EventStore crashed running on Azure Windows 2012 as a background process. Attached are the the error log and stats log.
The last few records in the error log:
[PID:03152:013 2013.09.30 21:46:26.223 ERROR QueuedHandlerMRES ] Error while processing message EventStore.Projections.Core.Messages.ProjectionManagementMessage+Reset in queued handler ‘Projections Master’.
System.NotSupportedException: Specified method is not supported.
at EventStore.Projections.Core.Services.Management.ManagedProjection.Stop(Action completed) in c:\BuildAgent1\work\oss\windows\releasebuilds\src\EventStore\EventStore.Projections.Core\Services\Management\ManagedProjection.cs:line 806
at EventStore.Projections.Core.Services.Management.ManagedProjection.Handle(Reset message) in c:\BuildAgent1\work\oss\windows\releasebuilds\src\EventStore\EventStore.Projections.Core\Services\Management\ManagedProjection.cs:line 331
at EventStore.Projections.Core.Services.Management.ProjectionManager.Handle(Reset message) in c:\BuildAgent1\work\oss\windows\releasebuilds\src\EventStore\EventStore.Projections.Core\Services\Management\ProjectionManager.cs:line 320
at EventStore.Core.Bus.MessageHandler`1.TryHandle(Message message) in c:\BuildAgent1\work\oss\windows\releasebuilds\src\EventStore\EventStore.Core\Bus\MessageHandler.cs:line 60
at EventStore.Core.Bus.InMemoryBus.Publish(Message message) in c:\BuildAgent1\work\oss\windows\releasebuilds\src\EventStore\EventStore.Core\Bus\InMemoryBus.cs:line 350
at EventStore.Core.Bus.QueuedHandlerMRES.ReadFromQueue(Object o) in c:\BuildAgent1\work\oss\windows\releasebuilds\src\EventStore\EventStore.Core\Bus\QueuedHandlerMRES.cs:line 141
[PID:03152:009 2013.09.30 21:52:40.524 ERROR QueuedHandlerMRES ] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 7253ms. Q: 0/22003.
[PID:03152:009 2013.09.30 21:53:25.053 ERROR QueuedHandlerMRES ] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 8898ms. Q: 0/12217.
[PID:03152:009 2013.09.30 23:05:05.283 ERROR TFChunk ] CACHING FAILED due to OutOfMemory exception in TFChunk #2-2 at chunk-000002.000000.
``
Records were being entered into the trace/debug/info log for about 20 minutes after the last error:
[PID:03152:011 2013.09.30 23:24:23.936 DEBUG TwoPhaseRequestManag] IDEMPOTENT WRITE TO STREAM ClientCorrelationID 1263e67c-014d-4a28-b0dc-e9935743652d, EventStreamId: $projections-scrapertest-checkpoint, CorrelationId: 1263e67c-014d-4a28-b0dc-e9935743652d, FirstEventNumber: 351, LastEventNumber: 351.
[PID:03152:014 2013.09.30 23:24:23.936 TRACE CoreProjectionCheckp] Checkpoint has be written for projection scrapertest at sequence number 351 (current)
[PID:03152:011 2013.09.30 23:24:23.953 DEBUG TwoPhaseRequestManag] IDEMPOTENT WRITE TO STREAM ClientCorrelationID bb8422ab-353d-45b2-a18b-a9fb35f42d95, EventStreamId: $projections-$users-checkpoint, CorrelationId: bb8422ab-353d-45b2-a18b-a9fb35f42d95, FirstEventNumber: 57, LastEventNumber: 57.
[PID:03152:015 2013.09.30 23:24:23.953 TRACE CoreProjectionCheckp] Checkpoint has be written for projection $users at sequence number 57 (current)
[PID:03152:015 2013.09.30 23:24:23.953 TRACE CoreProjectionCheckp] Writing checkpoint for $users at C:612036926/P:611564055; $UserCreated: -1; with expected version number 57
[PID:03152:015 2013.09.30 23:24:24.030 TRACE CoreProjectionCheckp] Checkpoint has be written for projection $users at sequence number 58 (current)
[PID:03152:031 2013.09.30 23:26:45.048 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue #3]: ReadAllEventsForward - 421ms. Q: 0/0.
As Greg explained, ES didn’t crash because caching failed (this error is caught and chunk continues to be not cached). But looking at other messages about —!!! VERY SLOW QUEUE MESSAGEs:
The last number shows how many messages there are in queue (in this case storage writer’s queue) at the time of completing current message. Those numbers are huge and they are probably the reasons why you eventually exhaust all the memory. You are heavily overloading the node, which is not good. This could be because you either don’t limit number of active writes at server (which is configurable in ClientAPI and defaults to 2000) and/or you are writing a lot of data in single batch, which is not optimal as well. Please, consider optimizing your write pattern so the system overall is more stable and not so resource-heavy. I bet if you look at usual (not -err) log, you’ll see a lot of SLOW MSG messages with pretty high numbers as well. VERY SLOW MSG is reported as error if single write takes more than 7 seconds (which happens only if either HDD dies or you write huge events or a lot of them in single write).
My guess is this has to do with the underlying storage of the azure background node. I imagine it is networked storage? What size node are you running on (some are limited).
We were running a Large Azure VM (7GB), writing data to c drive and logs to d drive - not currently using attached VHD. These are the EventStore command line args:
Each individual event is quite small, around 200 bytes. According to our estimates, the upper bound of what we can be writing is about 600 events per second, all into a single stream. We also have a projection off of this stream which simply counts the number of occurrences of each event type. We will try disabling the projection and running it on a separate node instead. I have the debug log (~ 10 MB) is that can be useful.
Something sounds off here. We have benchmarked ES easily at 4-5k transactions/second on an azure small node. Also with us using 700mb of memory you should not be getting OOM exceptions on a large node.
The projection should not be an issue with such a low event throughput. Can you describe a bit how the setup was done? This is IAAS?
How long did you run the benchmark for? In our case the error occurred after 7 hours. The EventStore node is running on a IAAS VM and there are PAAS worker roles, 5 of them, writing to the EventStore within the same VNet.
Lev, have you tried running ES with a dedicated VHD (read caching only)?
By default, disk C is shared with OS (i would assume it is used for paging as well) and it also has write caching enabled. The latter does not work well with ES, AFAIK.
I would expect it to work faster in benchmarks however.
Actually wait!!! @Lev how big is your data when this happens? Could you be almost out of space on that drive where writes take a very long time? Also what does the memory look like on ES when this is happening? On the box?
@Greg - the disk isn’t nearly out of space. Each individual event is around 200 bytes and the data directory is around 800 MB. According to the stats stream, the memory was fine also - nothing out of the ordinary around the time of the error. The stats were collected for 40 minutes after the “CACHING FAILED due to OutOfMemory exception” error. The last event was written at about 20 minutes after the error.
Also, looking at the last written events, there were 1,250 written in the last second based on time displayed in web interface. The writing seems to come in bursts. Is this due to issues with the Azure C drive? Or issue with the writers themselves (being investigated now)?
The writing in bursts is likely related to the storagewriter taking 7 seconds?! to do an fsync I am not sure what may be up with those C drives but perhaps they are not good for running on and only using vhd is better.
For the "nothing out of the ordinary around the time of the error. The stats were collected for 40 minutes after the “CACHING FAILED due to OutOfMemory exception” error. The last event was written at about 20 minutes after the error. "
This is an unmanaged malloc failing for 256mb. The error is handled just fine internally (its only for a cache, if it fails it will just keep working and not cache). The real question here is why a 256mb malloc would be failing on the system. I can only guess heavy memory pressure at the time.
Its quite simple really. Just what was recommended by Rinat, Chris, Greg. In this particular instance we have a large Azure VM with an attached VHD with ReadOnly host caching. The drive is created with the following Azure PowerShell command: