Question about utilization of internal queues

Been running an operation involving EventStore for about a week and have some questions about performance.

Setup

  • Azure ExtraLarge VM (14 GB memory, 8 logical CPUs)

  • 6 x 1 TB VHDs pooled into one 6 TB volume (RAID 0) (no caching, read caching can only be used with up to 4 pooled drives)

  • EventStore arguments: --run-projections=ALL --skip-db-verify --db=(path to pooled volume) --log=(path to C drive)

Workload

Event streams are written to in batches on a 1 sec interval. There is one stream which (currently) contains about 60 million events and generally, every other event in this stream corresponds to an additional event in a new stream. So there are also 30 million, 1 event streams (not enforced through metadata). There is a projection over the first stream which measures events/sec over the whole stream and in sliding, 100 event windows, snapshots of which are emitted to a separate stream. There is also another projection which measures stats for the 30 million streams using the $by_category projection, but this and the $by_category have been disabled out of concern.

Utilization

CPU ~13%

Memory ~200 MB available (98% used)

Disk Read ~ 250 KB/sec / Write ~ 550 KB/sec

Ethernet Send ~ 700 Kbps / Receive ~ 2 Mbps

Projection event/sec ~ 200

Data ~ 60 GB

Queues Snapshot

Snapshot taken at: Mon Nov 04 2013 11:26:33 GMT-0500 (Eastern Standard Time)
Name                              Length   Avg, items/s   Avg Time, ms/item   Busy, %      Peak   Max Peak   Total processed                 Current/Last message
==============================  ========  =============  ==================  ========  ========  =========  ================  ===================================
MainQueue                              0            486               0.003       0.1       144       3127         204039062               <none> / EventCommited
MonitoringQueue                        0              0              14.696       1.3         0          1             18689        GetFreshStats / GetFreshStats
Projection Core                        1           2177               0.166      25.1       252       3654         494824070                                  n/a
Projection Core #0                     0            247               0.045       1.1       250       3654         125841827             <none> / EventReaderIdle
Projection Core #1                     0            282               0.005       0.1       148        998         218085111             <none> / EventReaderIdle
Projection Core #2                     1           1648               0.449      74.0       252        726         150897132                  CoreTick / CoreTick
Projections Master                     0             47               0.016       0.1         8        313          38324739            <none> / StatisticsReport
Storage Chaser                         0            804               0.520      41.8         0          0         245738715       <none> / ChaserCheckpointFlush
StorageReaderQueue                     0             30              31.344      25.5         1       2014          20665597                                  n/a
StorageReaderQueue #1                  0              7              19.048      13.6         0        739           5166400        <none> / ReadAllEventsForward
StorageReaderQueue #2                  0              7              60.534      43.4         1        510           5166399  ReadStreamEventsForward / ReadAllEventsForward
StorageReaderQueue #3                  0              7               0.253       0.2         0       2014           5166399        <none> / ReadAllEventsForward
StorageReaderQueue #4                  0              9              45.541      44.9         1       1748           5166399        <none> / ReadAllEventsForward
StorageWriterQueue                     0             20              48.542     100.0         5      30237          23094108            WriteCommit / WriteCommit
Subscriptions                          0            385               0.000       0.0       177       1062          98795539               <none> / EventCommited
Timer                                 13             26               0.019       0.1        13         27          19402931       <none> / ExecuteScheduledTasks
Workers                                0             50               0.083       0.1         2          7          24943432                                  n/a
Worker #1                              0              8               0.084       0.1         2          7           6434333                    <none> / HttpSend
Worker #2                              0              9               0.061       0.1         1          5           4760110                    <none> / HttpSend
Worker #3                              0             12               0.099       0.1         1          6           5127474  <none> / IncomingHttpRequestMessage
Worker #4                              0             11               0.078       0.1         2          5           3924167  <none> / AuthenticatedHttpRequestMessage
Worker #5                              0             10               0.095       0.1         1          5           4697348                    <none> / HttpSend

The StorageWriterQueue has been red and there are a few errors logged (since Thursday) that indicate slow queues:

[PID:03612:013 2013.10.31 20:00:34.827 ERROR QueuedHandlerMRES ] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WriteCommit - 8549ms. Q: 0/13.

[PID:03612:013 2013.10.31 21:06:52.669 ERROR QueuedHandlerMRES ] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WriteCommit - 15243ms. Q: 0/27.

[PID:03612:013 2013.11.01 03:23:30.090 ERROR QueuedHandlerMRES ] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WriteCommit - 13675ms. Q: 0/25.

[PID:03612:013 2013.11.02 06:16:48.942 ERROR QueuedHandlerMRES ] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WriteCommit - 7203ms. Q: 0/18.

[PID:03612:031 2013.11.02 07:25:00.471 ERROR QueuedHandlerThreadP] —!!! VERY SLOW QUEUE MSG [StorageReaderQueue #4]: CheckStreamAccess - 33254ms. Q: 0/22.

[PID:03612:036 2013.11.02 07:25:00.471 ERROR QueuedHandlerThreadP] —!!! VERY SLOW QUEUE MSG [StorageReaderQueue #3]: CheckStreamAccess - 24844ms. Q: 0/12.

[PID:03612:034 2013.11.02 07:25:00.471 ERROR QueuedHandlerThreadP] —!!! VERY SLOW QUEUE MSG [StorageReaderQueue #1]: CheckStreamAccess - 27906ms. Q: 0/13.

[PID:03612:025 2013.11.02 07:25:00.471 ERROR QueuedHandlerThreadP] —!!! VERY SLOW QUEUE MSG [StorageReaderQueue #2]: CheckStreamAccess - 30955ms. Q: 0/16.

[PID:03612:013 2013.11.03 10:14:19.014 ERROR QueuedHandlerMRES ] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 9393ms. Q: 0/24.

No other errors have been logged so far.

Are these stats something to be concerned about? Would it be accepted to enable the $by_category projection as well as the second, custom projection? The StorageWriterQueue is consistently red and 99% busy.

Being red is not an issue as its adaptive. If for instance it takes 100 ms to fsync then 10 events per second could hit 100% utilization.

When you say memory stats are those on the box or the process? Windows will cache files aggressively but will also release them under pressure. What ate thd stats for the process?

Greg

Memory is for the whole system as is ethernet. EventStore process itself uses ~ 700 MB memory.

"If for instance it takes 100 ms to fsync then 10 events per second could hit 100% utilization."

How can the queue stats be used to make this interpretation? As I understand it, an EventStore write consists of a prepare and a commit. Does that mean that the fsync time of an event is the sum of fsyncs of both the prepare and write?

Depends how you write, are you using transactional writes eg begintransaction?

For memory this will be bypassed soon but i wouldnt worry too much just file caching

Greg

No transactions.

Understood about memory use, it doesn’t seem to cause problems. Though if the system only has ~ 300 MB available, can that affect performance of other processes initially or does windows do good job of reclaiming?

Last I checked windows is pretty good at giving it up. The philosophy being why leave the ram unused right now when we can make things faster by using it

With regards to the queues, what will happen if, say the StorageWriterQueue, remains at 100% utilization for extended periods of time? Out of memory or does it begin to push back? In this scenario, the queue seems to remain in high utilization but disk utilization for the process is around 1 MB/sec. Does this imply that the maximum throughput is being reached?

Not at all we use an adaptive algorithm to balance between throughput and latency. The cost is on the fsync, we try to balance low latency with given throughput eg at one write per 100 ms we will fsync each and get low latency. As throughput goes up we will begin batching.

I’ll be in NYC and Chicago in 8 weeks where will you be?

I’ll be in Denver on Wednesday January 29th through the weekend and in DC on Friday February 28th through the weekend, but other than that I’ll be in NY area, specifically Montclair NJ.

Good we are in South Africa then… Let’s try to do a NYC meetup?

Cool, let’s. What dates? Thoughts on topics? A coworker is co-organizer for DDD-NYC (together with Steve Bohen), so he could help with venue.