Eventstore cache out of memory issue

zhenguang_zhu · March 21, 2016, 9:32am

Hi all:

I found a out of memory error using eventstore in my prod env:

CACHING FAILED due to OutOfMemory exception in TFChunk

System.OutOfMemoryException: Exception of type ‘System.OutOfMemoryException’ was thrown.

at EventStore.Projections.Core.ProjectionCoreWorkersNode.CreateCoreWorkers(StandardComponents standardComponents, ProjectionsStandardComponents projectionsStandardComponents)

at EventStore.Projections.Core.ProjectionsSubsystem.Register(StandardComponents standardComponents)

at EventStore.Core.ClusterVNode…ctor(TFChunkDb db, ClusterVNodeSettings vNodeSettings, IGossipSeedSource gossipSeedSource, InfoController infoController, ISubsystem[] subsystems)

at EventStore.ClusterNode.Program.Create(ClusterNodeOptions opts)

at EventStore.Core.ProgramBase`1.Run(String[] args)

I used Eventstore 3.0.5 version

And I find a command line argument “ChunksCacheSize”, should I use it ? I can see the default value is 536871424, but what is the unit of this number?

Thank you.

Greg_Young1 · March 21, 2016, 9:50am

What OS are you running on what is the memory profile of that machine?

The default chunk size is: 256mb (and it caches 2 by default)

zhenguang_zhu · March 21, 2016, 10:50am

My OS is Windows Server 2008 R2

It have a command argument in the eventstore document

-ChunksCacheSize
–chunks-cache-size=VALUE
CHUNKSCACHESIZE
ChunksCacheSize
The amount of unmanaged memory to use for caching chunks. (Default: 536871424)Should I use it?

What is the default value unit?

在 2016年3月21日星期一 UTC+8下午5:50:12，Greg Young写道：

Greg_Young1 · March 21, 2016, 10:53am

What is the memory utilization on the box?

We commonly see in windows file cache eating up space and not releasing it in time. You can limit file cache on the machine using things like sysinternals https://technet.microsoft.com/en-us/sysinternals/bb897561

Charles_Solar · March 21, 2016, 8:59pm

I also see my domain instances using a lot of memory (5gb+) after a few hours of work. Eventually they crash in the same way as OP. I was working on running mem profiler to narrow it down, preliminary results show something in ES called LRUCache creates strings which are never released according to the profiler, but I am not certain of anything yet.

Greg_Young1 · March 22, 2016, 1:59am

I doubt you have 5gb of strings in the LRU cache. My guess is its 1)
unmanaged memory (512mb) and file cache mappings

Charles_Solar · March 23, 2016, 1:27am

No, definitely not 5gb of LRU caches

I think its the default MaxPushQueueSize - if events are kinda large (500kb+) 10k of these backing up would easily be 5g.

Maybe both of us are having events back up in the queue - I know I’ve been having issues with event processing speed for a while. Changing the max queue size is available for stream catch up subscriptions, can you add it to the all catch up subscriber too?

Greg_Young1 · March 23, 2016, 1:35am

Why do you have 500kb events? I would imagine processing speeds might
be a problem when 200 events/second is 100mb/second over the network

Charles_Solar · March 23, 2016, 1:52am

They are bulk events that appear 1-2 times a second, each event contains about 300 data rows.
Bulking them together into 1 event helps event handlers process all those things together. I’ve thought about dividing them up into 300 separate events but doing that almost double the amount the events flowing through ES.

Either way Im going to generate at least 1mb a second when doing bulk importing, the only difference is how many events get queued up into the subscribers and I’d rather have them shutdown due to processing queue overflow than an obscure out of memory exception.

Greg_Young1 · March 23, 2016, 1:58am

It already supports it just need to expose it.
https://github.com/EventStore/EventStore/blob/release-v3.6.0/src/EventStore.ClientAPI/EventStoreCatchUpSubscription.cs#L118
I'll PR it in the morning.

https://github.com/EventStore/EventStore/blob/release-v3.6.0/src/EventStore.ClientAPI/IEventStoreConnection.cs#L254
vs
https://github.com/EventStore/EventStore/blob/release-v3.6.0/src/EventStore.ClientAPI/IEventStoreConnection.cs#L362

Greg_Young1 · March 23, 2016, 3:25am

How are you changing the maxQueueSize now?

Charles_Solar · March 23, 2016, 2:30pm

I'm not, that's what I want to do. The stream subscription function has the argument to modify that max but the all subscriber doesn't

Greg_Young1 · March 23, 2016, 2:31pm

Which stream subscription function?

Charles_Solar · March 23, 2016, 6:08pm

https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FEventStore%2FEventStore%2Fblob%2Frelease-v3.6.0%2Fsrc%2FEventStore.ClientAPI%2FEventStoreCatchUpSubscription.cs%23L118&sa=D&sntz=1&usg=AFQjCNFfu4vCemaKfl-6AIyZZQE9e-tNQQ

EventStoreCatchUpSubscription

But it's missing on
EventStoreAllCatchUpSubscription

I'm traveling today can only use phone so I apologize if the formatting or links are wrong.

Greg_Young1 · March 23, 2016, 6:16pm

But they aren't exposed (either of them)

https://github.com/EventStore/EventStore/blob/release-v3.6.0/src/EventStore.ClientAPI/EventStoreCatchUpSubscription.cs#L523
https://github.com/EventStore/EventStore/blob/release-v3.6.0/src/EventStore.ClientAPI/EventStoreCatchUpSubscription.cs#L419

you linked to the base which has it:

https://github.com/EventStore/EventStore/blob/release-v3.6.0/src/EventStore.ClientAPI/EventStoreCatchUpSubscription.cs#L118

So I am curious how you are using it?!

Greg

Charles_Solar · March 23, 2016, 6:28pm

I'm not lol..
I want to, to prevent out of memory exceptions id rather have processing queue overflow exceptions

Greg_Young1 · March 23, 2016, 6:29pm

I see now...

https://github.com/EventStore/EventStore/blob/release-v3.5.0/src/EventStore.ClientAPI/EventStoreCatchUpSubscription.cs#L115

it is exposed in 3.5.0 there was a big refactor in 3.6.0 here and the
parameter is removed.

Greg_Young1 · March 23, 2016, 6:33pm

No I take that back it isn't exposed on either in 3.5.0 either....

Are you actually using it or just saw it in xml docs etc?

It is defined on the base class but not exposed through either derived class.

Greg_Young1 · March 23, 2016, 9:50pm

Here you go https://github.com/EventStore/EventStore/pull/863

Charles_Solar · April 1, 2016, 5:23am

I am still having issues here with cached TFChunks unfortunately.
At first I thought it was my app growing too large for ES since I was using the embedded store. So I switched to running standalone nodes and received similar results.

The store only grows to a max of 700 megs in memory, kind of osculating between 300 and 700 but after a few hours I start seeing

[PID:02880:011 2016.04.01 04:30:50.482 ERROR TFChunk ] CACHING FAILED due to OutOfMemory exception in TFChunk #2-2 (chunk-000002.000000).

After 4-5 of these the store just closes.

I figured the store is running out of unmanaged memory so I added some command line arguments. Here is the full list:

C:\eventstore\EventStore.ClusterNode.exe --int-ip 172.16.0.200 --ext-ip 172.16.0.200 --chunks-cache-size=1073741824 --log=C:\logs\ --db=C:\data --cached-chunks=2 --int-tcp-port=3113 --ext-tcp-port=3112 --int-http-port=3110 --ext-http-port=3111 --int-http-prefixes=http://:3110/ --ext-http-prefixes=http://:3111/ --cluster-size=5 --discover-via-dns=false --gossip-timeout-ms=2000 --gossip-seed=172.16.0.200:3110,172.16.0.201:3110,172.16.0.202:3110,172.16.0.203:3110,172.16.0.204:3110

I set the cache size to 1g up from 512m hoping the chunks would have enough space to be swapped in and out but no dice. I do have a suspect however - now that I have a store log with trace on I can see the following:

[PID:02880:008 2016.04.01 04:55:02.039 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue #1]: ReadAllEventsForward - 218ms. Q: 0/3.

[PID:02880:006 2016.04.01 04:55:02.039 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue #3]: ReadAllEventsForward - 234ms. Q: 0/4.

[PID:02880:011 2016.04.01 04:55:02.992 TRACE TFChunk ] CACHED TFChunk #3-3 (chunk-000003.000000) in 00:00:00.0003341.

[PID:02880:006 2016.04.01 04:55:03.042 TRACE TFChunk ] UNCACHED TFChunk #1-1 (chunk-000001.000000).

[PID:02880:013 2016.04.01 04:55:03.117 INFO MasterReplicationSer] Subscribed replica [172.16.0.204:3113,S:49a8a465-1fc0-450f-8d01-f50e3c89388f] for data send at 805306368 (0x30000000).

[PID:02880:013 2016.04.01 04:55:03.133 INFO MasterReplicationSer] Subscribed replica [172.16.0.202:3113,S:d4f418b4-5f91-499b-a7cc-09f0fe28a5cf] for data send at 805306368 (0x30000000).

[PID:02880:013 2016.04.01 04:55:03.133 INFO MasterReplicationSer] Subscribed replica [172.16.0.203:3113,S:db3328d4-6c8a-45c0-8b10-db74451568e0] for data send at 805306368 (0x30000000).

[PID:02880:013 2016.04.01 04:55:03.133 INFO MasterReplicationSer] Subscribed replica [172.16.0.200:3113,S:825e34dc-d662-4ac3-82b9-658782eb3e5d] for data send at 805306368 (0x30000000).

``

Its caching a new chunk BEFORE uncaching the old?

This is an example of an OutOfMemory cache

[PID:02880:021 2016.04.01 04:28:24.544 TRACE InMemoryBus ] SLOW BUS MSG [Worker #2 Bus]: TcpSend - 214ms. Handler: TcpSendService.

[PID:02880:021 2016.04.01 04:28:24.560 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [Worker #2]: TcpSend - 230ms. Q: 2/9.

[PID:02880:011 2016.04.01 04:30:50.482 ERROR TFChunk ] CACHING FAILED due to OutOfMemory exception in TFChunk #2-2 (chunk-000002.000000).

[PID:02880:005 2016.04.01 04:30:50.560 TRACE TFChunk ] UNCACHED TFChunk #0-0 (chunk-000000.000000).

[PID:02880:013 2016.04.01 04:30:50.608 INFO MasterReplicationSer] Subscribed replica [172.16.0.204:3113,S:49a8a465-1fc0-450f-8d01-f50e3c89388f] for data send at 536870912 (0x20000000).

[PID:02880:013 2016.04.01 04:30:50.608 INFO MasterReplicationSer] Subscribed replica [172.16.0.202:3113,S:d4f418b4-5f91-499b-a7cc-09f0fe28a5cf] for data send at 536870912 (0x20000000).

``

Could this be causing the issues we are seeing? Any way to fix?

I am using that CacheSet tool btw, I set min/max to 1024 / 2097152 (2g) with no change in results.