CACHING FAILED due to OutOfMemory exception in TFChunk

Leo_Gorodinski · October 2, 2013, 2:05pm

EventStore crashed running on Azure Windows 2012 as a background process. Attached are the the error log and stats log.

The last few records in the error log:

[PID:03152:013 2013.09.30 21:46:26.223 ERROR QueuedHandlerMRES ] Error while processing message EventStore.Projections.Core.Messages.ProjectionManagementMessage+Reset in queued handler ‘Projections Master’.
System.NotSupportedException: Specified method is not supported.
at EventStore.Projections.Core.Services.Management.ManagedProjection.Stop(Action completed) in c:\BuildAgent1\work\oss\windows\releasebuilds\src\EventStore\EventStore.Projections.Core\Services\Management\ManagedProjection.cs:line 806
at EventStore.Projections.Core.Services.Management.ManagedProjection.Handle(Reset message) in c:\BuildAgent1\work\oss\windows\releasebuilds\src\EventStore\EventStore.Projections.Core\Services\Management\ManagedProjection.cs:line 331
at EventStore.Projections.Core.Services.Management.ProjectionManager.Handle(Reset message) in c:\BuildAgent1\work\oss\windows\releasebuilds\src\EventStore\EventStore.Projections.Core\Services\Management\ProjectionManager.cs:line 320
at EventStore.Core.Bus.MessageHandler`1.TryHandle(Message message) in c:\BuildAgent1\work\oss\windows\releasebuilds\src\EventStore\EventStore.Core\Bus\MessageHandler.cs:line 60
at EventStore.Core.Bus.InMemoryBus.Publish(Message message) in c:\BuildAgent1\work\oss\windows\releasebuilds\src\EventStore\EventStore.Core\Bus\InMemoryBus.cs:line 350
at EventStore.Core.Bus.QueuedHandlerMRES.ReadFromQueue(Object o) in c:\BuildAgent1\work\oss\windows\releasebuilds\src\EventStore\EventStore.Core\Bus\QueuedHandlerMRES.cs:line 141
[PID:03152:009 2013.09.30 21:52:40.524 ERROR QueuedHandlerMRES ] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 7253ms. Q: 0/22003.
[PID:03152:009 2013.09.30 21:53:25.053 ERROR QueuedHandlerMRES ] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 8898ms. Q: 0/12217.
[PID:03152:009 2013.09.30 23:05:05.283 ERROR TFChunk ] CACHING FAILED due to OutOfMemory exception in TFChunk #2-2 at chunk-000002.000000.

``

Records were being entered into the trace/debug/info log for about 20 minutes after the last error:

[PID:03152:011 2013.09.30 23:24:23.936 DEBUG TwoPhaseRequestManag] IDEMPOTENT WRITE TO STREAM ClientCorrelationID 1263e67c-014d-4a28-b0dc-e9935743652d, EventStreamId: $projections-scrapertest-checkpoint, CorrelationId: 1263e67c-014d-4a28-b0dc-e9935743652d, FirstEventNumber: 351, LastEventNumber: 351.
[PID:03152:014 2013.09.30 23:24:23.936 TRACE CoreProjectionCheckp] Checkpoint has be written for projection scrapertest at sequence number 351 (current)
[PID:03152:011 2013.09.30 23:24:23.953 DEBUG TwoPhaseRequestManag] IDEMPOTENT WRITE TO STREAM ClientCorrelationID bb8422ab-353d-45b2-a18b-a9fb35f42d95, EventStreamId: $projections-$users-checkpoint, CorrelationId: bb8422ab-353d-45b2-a18b-a9fb35f42d95, FirstEventNumber: 57, LastEventNumber: 57.
[PID:03152:015 2013.09.30 23:24:23.953 TRACE CoreProjectionCheckp] Checkpoint has be written for projection $users at sequence number 57 (current)
[PID:03152:015 2013.09.30 23:24:23.953 TRACE CoreProjectionCheckp] Writing checkpoint for $users at C:612036926/P:611564055; $UserCreated: -1; with expected version number 57
[PID:03152:015 2013.09.30 23:24:24.030 TRACE CoreProjectionCheckp] Checkpoint has be written for projection $users at sequence number 58 (current)
[PID:03152:031 2013.09.30 23:26:45.048 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue #3]: ReadAllEventsForward - 421ms. Q: 0/0.

``

Any ideas?

0.0.0.0-2113-single-node-err.log (11.1 KB)

0.0.0.0-2113-single-node-stats.csv (924 KB)

Greg_Young1 · October 2, 2013, 2:20pm

Try with -c 0 (to disable unmanaged caching). By default it will cache up to 512mb of unmanaged memory.

Cheers,

Greg

Leo_Gorodinski · October 2, 2013, 2:25pm

Thanks, trying now…

Greg_Young1 · October 2, 2013, 2:36pm

Also as you can see. Event Store recovered from the error. Basically when we went to get 256mb of memory we got an error allocating. We printed out a log message but continued running without the chunk cached into memory (eg its an expected case). You can checkout the code responsible in this case here: https://github.com/EventStore/EventStore/blob/master/src/EventStore/EventStore.Core/TransactionLog/Chunks/TFChunk/TFChunk.cs line 430 or so.

Andrii_Nakryiko · October 2, 2013, 3:10pm

Leo,

As Greg explained, ES didn’t crash because caching failed (this error is caught and chunk continues to be not cached). But looking at other messages about —!!! VERY SLOW QUEUE MESSAGEs:

[PID:03152:009 2013.09.30 21:43:36.059 ERROR QueuedHandlerMRES ] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 16754ms. Q: 0/84830.

[PID:03152:009 2013.09.30 21:52:40.524 ERROR QueuedHandlerMRES ] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 7253ms. Q: 0/22003.

[PID:03152:009 2013.09.30 21:53:25.053 ERROR QueuedHandlerMRES ] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 8898ms. Q: 0/12217.

The last number shows how many messages there are in queue (in this case storage writer’s queue) at the time of completing current message. Those numbers are huge and they are probably the reasons why you eventually exhaust all the memory. You are heavily overloading the node, which is not good. This could be because you either don’t limit number of active writes at server (which is configurable in ClientAPI and defaults to 2000) and/or you are writing a lot of data in single batch, which is not optimal as well. Please, consider optimizing your write pattern so the system overall is more stable and not so resource-heavy. I bet if you look at usual (not -err) log, you’ll see a lot of SLOW MSG messages with pretty high numbers as well. VERY SLOW MSG is reported as error if single write takes more than 7 seconds (which happens only if either HDD dies or you write huge events or a lot of them in single write).

Hope this helps.

Greg_Young1 · October 2, 2013, 3:14pm

My guess is this has to do with the underlying storage of the azure background node. I imagine it is networked storage? What size node are you running on (some are limited).

Cheers,

Greg

Leo_Gorodinski · October 2, 2013, 4:50pm

We were running a Large Azure VM (7GB), writing data to c drive and logs to d drive - not currently using attached VHD. These are the EventStore command line args:

–run-projections=ALL --prefixes=http://*:2113/ --ip=0.0.0.0 --db=c:\es\data --log=d:\es\logs

Each individual event is quite small, around 200 bytes. According to our estimates, the upper bound of what we can be writing is about 600 events per second, all into a single stream. We also have a projection off of this stream which simply counts the number of occurrences of each event type. We will try disabling the projection and running it on a separate node instead. I have the debug log (~ 10 MB) is that can be useful.

Greg_Young1 · October 2, 2013, 4:53pm

Something sounds off here. We have benchmarked ES easily at 4-5k transactions/second on an azure small node. Also with us using 700mb of memory you should not be getting OOM exceptions on a large node.

The projection should not be an issue with such a low event throughput. Can you describe a bit how the setup was done? This is IAAS?

Cheers,

Greg

Leo_Gorodinski · October 2, 2013, 5:31pm

How long did you run the benchmark for? In our case the error occurred after 7 hours. The EventStore node is running on a IAAS VM and there are PAAS worker roles, 5 of them, writing to the EventStore within the same VNet.

Greg_Young1 · October 2, 2013, 5:35pm

Days with 10 threads running write floods. The difference though is we were writing to azure storage not to the local temp C drive.

Rinat_Abdullin · October 2, 2013, 5:47pm

Lev, have you tried running ES with a dedicated VHD (read caching only)?

By default, disk C is shared with OS (i would assume it is used for paging as well) and it also has write caching enabled. The latter does not work well with ES, AFAIK.

Best,

Rinat

Greg_Young1 · October 2, 2013, 5:51pm

I would expect it to work faster in benchmarks however.

Actually wait!!! @Lev how big is your data when this happens? Could you be almost out of space on that drive where writes take a very long time? Also what does the memory look like on ES when this is happening? On the box?

Cheers,

Greg

Leo_Gorodinski · October 2, 2013, 6:47pm

@Rinat - thanks, will try to write to VHD.

@Greg - the disk isn’t nearly out of space. Each individual event is around 200 bytes and the data directory is around 800 MB. According to the stats stream, the memory was fine also - nothing out of the ordinary around the time of the error. The stats were collected for 40 minutes after the “CACHING FAILED due to OutOfMemory exception” error. The last event was written at about 20 minutes after the error.

Also, looking at the last written events, there were 1,250 written in the last second based on time displayed in web interface. The writing seems to come in bursts. Is this due to issues with the Azure C drive? Or issue with the writers themselves (being investigated now)?

Greg_Young1 · October 2, 2013, 6:50pm

The writing in bursts is likely related to the storagewriter taking 7 seconds?! to do an fsync I am not sure what may be up with those C drives but perhaps they are not good for running on and only using vhd is better.

For the "nothing out of the ordinary around the time of the error. The stats were collected for 40 minutes after the “CACHING FAILED due to OutOfMemory exception” error. The last event was written at about 20 minutes after the error. "

This is an unmanaged malloc failing for 256mb. The error is handled just fine internally (its only for a cache, if it fails it will just keep working and not cache). The real question here is why a 256mb malloc would be failing on the system. I can only guess heavy memory pressure at the time.

Cheers,

Greg

Chris_Martin · October 2, 2013, 7:02pm

Rinat is correct. You’ll want to use a VHD with read-cache turned on (don’t enable write-cache).

Leo_Gorodinski · October 4, 2013, 12:31pm

Running much better now, thank you all!

jen20 · October 4, 2013, 12:41pm

Ah great. Would you mind sharing your config, we can probably make a wiki page out of it to make it easier for the next people who try this?

James

Leo_Gorodinski · October 4, 2013, 12:49pm

Its quite simple really. Just what was recommended by Rinat, Chris, Greg. In this particular instance we have a large Azure VM with an attached VHD with ReadOnly host caching. The drive is created with the following Azure PowerShell command:

Add-AzureDataDisk -CreateNew -DiskLabel “data” -HostCaching ReadOnly -DiskSizeInGB 512

``

And the whole VM was created with this:

New-AzureVMConfig -name $vmName -InstanceSize $size -ImageName $image |
Add-AzureProvisioningConfig -Windows -AdminUsername $user -Password $pwd |
Set-AzureSubnet $subnet -Verbose |
Add-AzureDataDisk -CreateNew -DiskLabel “data” -HostCaching ReadOnly -DiskSizeInGB 512 -LUN 2 -Verbose |
New-AzureVM -ServiceName $serviceName -VNetName $vnet -WaitForBoot -ErrorAction Stop -Verbose

``

EventStore was ran as a process with these arguments:

–run-projections=ALL --prefixes=http://*:2113/ --ip=0.0.0.0 --db=f:\es\data --log=d:\es\logs

``

The attached VHD was assigned drive letter F.

The PowerShell commands to initialize, partition and format the attached VHD are:

Initialize-Disk -Number $disc -PartitionStyle MBR

$dataDrive = New-Partition -DiskNumber $disc -UseMaximumSize -AssignDriveLetter |
Format-Volume -NewFileSystemLabel “data” -Force -Confirm:$false

$dataPath = $dataDrive.DriveLetter + “:\es\data”

``

Leo_Gorodinski · October 4, 2013, 12:51pm

@James, when Ido gets in I will ask him more about his thoughts on perf since he was running the workers which were writing to EvenStore.

Leo_Gorodinski · October 4, 2013, 2:26pm

Ido said the EventStore was certainly not the bottleneck - it was easily writing 1000 events/sec with this configuration.