Production down due to Out Of Memory

Gabriel_Schenker · January 30, 2014, 10:13pm

We have a few clients that had their servers completely stalled due to the fact that GetEventStore (GES) ate up all their memory. We can reproduce the issue in-house both with version 1.0.1 and 2.0.1 of GES; they show the exact same behavior.

If we have a data-set with 13 chunks (each 256 MB) GES immediately allocates about 13*256 MB ~ 3.3 GB RAM while starting up. The memory is NOT released afterwards. As soon as we stop GES the memory is released. See attached screenshot of the Task Manager.

To run the tests we start GES from the command line with the following parameters

EventStore.SingleNode.exe --db c:\path-to-db --ip=some-ip-address

Event when adding -c 0 to the command line parameters, i.e.

EventStore.SingleNode.exe --db c:\path-to-db --ip=some-ip-address -c 0

the situation does not change significantly. GES uses maybe 200 MB less.

Some of our customers have already 20+ chunks and their numbers are growing fast.

Usually the environment we use is a dedicated VM with 8 GB of RAM. In the above cases we could mitigate the situation by adding another 8 GB RAM but customers are concerned.

I feel like we do something fundamentally wrong but looking at the documentation (specifically the command line parameters) we do not see anything evident. Any help is appreciated

Gabriel_Schenker · January 30, 2014, 10:47pm

I should add that I just built GES from source (dev branch) and the behavior is exactly the same

Yuri_Solodkyy · January 30, 2014, 11:23pm

Gabriel,

the screenshot you show probably includes file cache size and it is typical when you run event store. Typically windows does not release file cache memory unless it experience memory pressure. This is my guess only, however.

Do you use projections?

Could you inspect .NET related and memory related performance counters for event store process?

Do you see anything interesting in log files?

What is average size of your events?

What happens when your server goes down with memory out? Does the process crash?

Event 100 chunks should not be a problem on 8GB RAM

-yuriy

Greg_Young1 · January 31, 2014, 3:36am

Ges is not allocating this memory. The file system is. This is completely normal as the file system is caching files. If you put it under pressure the file system gives up it’s caching. If you look at process memory ges is probably using about 800m by default configuration.

Greg_Young1 · January 31, 2014, 3:38am

Btw in a newer branch we are also supporting memory mapped files but the same type of behaviour will be able to happen as the os will be swapping out the pages.

Greg_Young1 · January 31, 2014, 3:43am

Btw did production actually go down as in subject or are you just worried it may?

Also it would be fairly trivial for me to write in an argument to make sure file system caching is disabled. This will however affect performance.

Gabriel_Schenker · January 31, 2014, 4:16am

The servers went down and we had to increase memory to 16GB.
The eventstore process continues under memory pressure paging excessively. The other processes like w3wp report insufficient memory to process requests.

We will monitor the suggested counters tomorrow during the day

Gabriel_Schenker · January 31, 2014, 4:19am

BTW shouldn’t we be able to set an upper limit to the memory consumption of GES similar to an RDBMS. IIRC Ayende does this also for Raven DB. Mongo DB that we use in production also is very humble regarding memory consumption.

Gabriel_Schenker · January 31, 2014, 4:21am

We do not run any projections on GES. We have loads of different events and their size varies from tiny to maybe 1kB if big… but I do not really have any exact numbers there

Greg_Young1 · January 31, 2014, 4:31am

If you want to avoid the initial use of memory set the don’t validate chunks option at startup. By default we will open all chunks at startup and validate their checksums. This also primes them into the file cache. As far as limiting things I believe you can do this in the windows file cache itself. What we are doing is just allowing the os to make such decisions.

Skip verify:

--do-not-verify-db-hashes-on-startup,`–skip-db-verify

`

Maybe it makes sense to allow for the explicit telling of the os to not cache things. In your situation this would prevent the chunks from ever going into the cache. I’m surprised however your file system cache doesn’t give up memory under pressure, here are some config points for it:

https://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/fsutil.mspx?mfr=true behaviour for memory usages can be st there.

Cheers,

Greg

Gabriel_Schenker · January 31, 2014, 4:33am

thanks. We will test all this carefully and report about the outcome.

Greg_Young1 · January 31, 2014, 5:47am

It would be useful if we covered some of these admin type things on a FAQ for administration. In principle it should not be a bad thing (windows file caches do this all the time on services however it can be scary to see the overall memory usage jump.

Greg

Yuri_Solodkyy · January 31, 2014, 9:22am

Gabriel,

are you accessing EventStore via native TCP interface (clientAPI) or HTTP?

-yuriy

Greg_Young1 · January 31, 2014, 12:27pm

@gabriel

Can you setup the machine with two disks instead of one. Put es data on its own disk and disable caching. After you should see no memory usage going up. Does this make sense?

Cheers,

Greg

Gabriel_Schenker · January 31, 2014, 1:58pm

@Yuriy: We’re accessing GES via TCP interface only.
@Greg: We’ll try all suggestions (by the way, the documentation about chunck cache and cached chunck size is confusion and/or contains mistakes https://github.com/eventstore/eventstore/wiki/Command-Line-Arguments. What is the difference between the two?)

Gabriel_Schenker · January 31, 2014, 3:06pm

do you think this (limiting cache size)

http://technet.microsoft.com/en-us/sysinternals/bb897561

could help us to get things under control?

Greg_Young1 · January 31, 2014, 3:16pm

The difference is what we put in memory vs what a file system puts into memory

Greg_Young1 · January 31, 2014, 3:18pm

I have not tested this on later versions. However a simple test was provided if it works focus on configuration.

Josh1 · July 22, 2015, 11:19pm

We are experiencing this issue as well. I didn’t notice a follow up from Gabriel here. What was the solution?

We had to use CacheSet from SysInternals to set a threshhold on boot to control the problem.

Are there other solutions, perhaps settings in Event Store available to take care of this?

jen20 · July 22, 2015, 11:35pm

No, it’s the Windows file cache causing the issue here, so there’s nothing we can do about it short of using unbuffered IO. The sysinternals tool is about the best you can do.