Event Store eating up all most available memory

Paul_Dubs · September 6, 2015, 7:06pm

Hello everyone,
We are experiencing a reappearing problem where GES (version 3.0.1 through 3.0.5) is taking up a lot of memory for apparently no good reason.

It is running in a single node configuration on Ubuntu 14.04.2 LTS using this command line:

./run-node.sh --db /var/lib/eventstore --cached-chunks=0

We added --cached-chunks=0 after seeing the same behavior and searching this group for a solution.

It runs inside a VM with 2 available cores at 2.8 GHz and 6 GB RAM. The underlying hardware is mostly idle, so it is not starving for IO.

We are using GES through its HTTP interface and right now the software is idle (as it is most of the time, it usually has only 2 bursts of activity throughout the day with a few hundred events happening in several minutes; i.e. it doesn’t strain the system nor GES at all).

The log contains the following bits repeatedly:

[23549,10,18:39:10.157] SLOW BUS MSG [MainBus]: PurgeTimedOutRequests - 57ms. Handler: HttpService.

[23549,11,18:39:10.168] SLOW BUS MSG [manager input bus]: RegularTimeout - 63ms. Handler: ProjectionCoreCoordinator.

[23549,10,18:39:10.177] SLOW QUEUE MSG [MainQueue]: PurgeTimedOutRequests - 207ms. Q: 0/5.

[23549,11,18:39:10.177] SLOW QUEUE MSG [Projections Master]: RegularTimeout - 80ms. Q: 0/1.

[23549,70,18:39:34.513] Couldn’t get drive name for directory ‘/var/lib/eventstore’ on Unix.

ApplicationName=‘df’, CommandLine=’-P /var/lib/eventstore’, CurrentDirectory=’’, Native error= Out of memory

Usually the Linux Out Of Memory killer will kill the event store somewhen if we leave this state for long enough. So it is not an issue with df, as assumed in another thread where probably the same problem came up but which never saw a conclusion.

In the last snapshot before the oom-killer has gone to do its work, GES was hogging 4787M of resident memory, after GES restarted it was using only 292MB of resident memory which is a bit less then what it usually uses when it works as expected.

Right now we are still evaluating GES, so these irregular crashes are not interfering with production. We really like the features of GES, but with that kind of stability problems we certainly will not be able to use it.

So are we just using (configuring) it wrong? Does GES need more memory than 6 GB in oder to handle several (at most) thousand messages per day? Do you need any more information?

Kind Regards,

Paul

Greg_Young1 · September 7, 2015, 5:17am

How are you measuring memory used? Es uses disk cache which depending on measurement may or may not get counted. If it’s too high configure cache.

Paul_Dubs · September 7, 2015, 7:41am

Hello Greg,
thanks for the swift answer. I am aware that ES uses disk cache, which on Linux is usually reported separately.

The memory usage that I posted here is from the output of “htop” in the “RES” column. According to the htop man page it means *“The resident set size, i.e the size of the text and data sections, plus stack usage.” *

Right now ES is again growing with no load. It is already at about 3.2GB according to the “RES” column of htop. “ps aux” reports that it is using 3475972 kilobytes in its “RSS” column (“resident set size, the non-swapped physical memory that a task has used (inkiloBytes)”) and “pmap -d” reports “mapped: 5264204K writeable/private: 4240736K shared: 4K”. Notice that the values are not exactly the same because ES is currently actively growing.

The last entries in the log (those before it are more of the same):

[01449,85,07:31:31.793] SLOW QUEUE MSG [StorageReaderQueue #3]: ReadStreamEventsBackward - 254ms. Q: 0/0.

[01449,85,07:31:32.178] SLOW QUEUE MSG [StorageReaderQueue #4]: ReadStreamEventsBackward - 281ms. Q: 0/0.

[01449,85,07:31:32.602] SLOW QUEUE MSG [StorageReaderQueue #1]: ReadStreamEventsBackward - 300ms. Q: 0/0.

[01449,87,07:31:33.045] SLOW QUEUE MSG [StorageReaderQueue #3]: ReadStreamEventsBackward - 324ms. Q: 0/0.

[01449,87,07:31:33.488] SLOW QUEUE MSG [StorageReaderQueue #4]: ReadStreamEventsBackward - 306ms. Q: 0/0.

[01449,85,07:31:33.959] SLOW QUEUE MSG [StorageReaderQueue #2]: ReadStreamEventsBackward - 362ms. Q: 0/0.

[01449,87,07:31:34.327] SLOW QUEUE MSG [StorageReaderQueue #3]: ReadStreamEventsBackward - 262ms. Q: 0/0.

[01449,87,07:31:34.698] SLOW QUEUE MSG [StorageReaderQueue #1]: ReadStreamEventsBackward - 255ms. Q: 0/0.

[01449,85,07:31:35.028] SLOW QUEUE MSG [StorageReaderQueue #2]: ReadStreamEventsBackward - 225ms. Q: 0/0.

So unless you are using some in-process disk caching instead of relying on the OS for it, it looks like there is a massive memory leak - or that I don’t understand how to configure the cache for ES.

As you can see in the original message we already tried disabling the chunk cache (–cached-chunks=0). We tried that first because I couldn’t get my head around which value for --chunks-cache-size would be appropriate (the documentation doesn’t mention which unit is used for the value, and the default value doesn’t indicate that it is in bytes as it doesn’t divide cleanly by 1024).

Kind Regards,

Paul

Greg_Young1 · September 7, 2015, 11:15am

How are you running? Is this binaries or?

Paul_Dubs · September 7, 2015, 11:32am

We are using the binaries with mono statically linked in (as they have been offered for download from geteventstore.com)

Greg_Young1 · September 7, 2015, 3:24pm

"It runs inside a VM with 2 available cores at 2.8 GHz and 6 GB RAM.
The underlying hardware is mostly idle, so it is not starving for IO."

Obviously others are not having this issue so I am trying to figure
out what might be different. On 14.10 here (and osx) memory usage
seems pretty stable especially when not doing anything.

"We are using GES through its HTTP interface and right now the
software is idle (as it is most of the time, it usually has only 2
bursts of activity throughout the day with a few hundred events
happening in several minutes; i.e. it doesn't strain the system nor
GES at all)."

It doesn't really do much when not being used. Mostly statistics tracking etc.

btw: you should probably keep the cached chunk count at 1 or 2 if you
can (it can make a difference performance wise). Basically its just
saying to cache that chunk in memory (unmanaged) and is not very
expensive.

"Usually the Linux Out Of Memory killer will kill the event store
somewhen if we leave this state for long enough. So it is not an issue
with df, as assumed in another thread where probably the same problem
came up but which never saw a conclusion."

Its the OS telling us that df failed because there wasn't enough
memory for it to be run.

"So are we just using (configuring) it wrong? Does GES need more
memory than 6 GB in oder to handle several (at most) thousand messages
per day? Do you need any more information?"

Not at all I currently am soak testing a 5 node cluster with about 1b
messages without issue.

Greg_Young1 · September 7, 2015, 6:37pm

After a few hours (OSX)

45569 mono-sgen 9.9 16:40.23 41 0 172 404 551M
551M 0B 0B 690M 3132M 45569 13381 sleeping 501 508449+

Greg_Young1 · September 7, 2015, 7:44pm

45569 mono-sgen 9.5 20:18.10 41 0 180 408 556M
556M 0B 0B 694M 3136M 45569 13381 sleeping 501 587707+

Greg_Young1 · September 8, 2015, 8:30am

45569 mono-sgen 8.8 37:18.32 41 0 180 405 506M
506M 0B 52M 691M 3133M 45569 13381 sleeping 501 938093+

Still stable. I will try an ubuntu machine as well in the cloud later
this evening.

Can you tell us more about your environment/use?

Paul_Dubs · September 8, 2015, 8:32am

That’s actually the problem: it runs really fine, for most of the time, and then out of the blue its memory usage just blows up. For the last day it has been running alright, and before the last incident it worked for 3 weeks without a hitch. And then it has blown up three times during the same day. All of that without any change in load or access patterns.

https://groups.google.com/forum/#!searchin/event-store/out$20of$20memory/event-store/nZ3tGVl30FI/bK5U1-Pd9V0J

Here it seems to be a windows caching thing, but Gabriel never actually said that their problem had really gone away, they just threw even more memory at it to control the problem. He also never detailed anything about the actual memory consumption of the process. I don’t know enough about disk caching in windows, but on linux the kernel will free memory that is used for disk caching if there are applications that need it.

https://groups.google.com/forum/#!searchin/event-store/df/event-store/oMpk7lbDiiU/gzjzSHyaUS0J

This thread is somewhat cut short, but I can’t find which thread is the original. But df getting an out of memory problem is definitely associated with something eating up all available memory.

So it seems I am not completely alone with the problem.

Let me add some more system details, as maybe anything I haven’t mentioned yet may be the culprit:

Kernel: Linux 3.13.0-63-generic

Architecture: x86_64

Swap: Disabled (<-- the reason why the out of memory killer actually goes to work instead of the sever stalling)

Is there anything that I can do do get any more information that I can share for the next time ES starts eating our memory?

Greg_Young1 · September 8, 2015, 9:24am

"Here it seems to be a windows caching thing, but Gabriel never
actually said that their problem had really gone away, they just threw
even more memory at it to control the problem. He also never detailed
anything about the actual memory consumption of the process. I don't
know enough about disk caching in windows, but on linux the kernel
will free memory that is used for disk caching if there are
applications that need it."

This is windows disk caching. It is unbounded by default and often
doesn't release fast enough limiting the cache resolves this.

"Is there anything that I can do do get any more information that I
can share for the next time ES starts eating our memory?"

process information. A memory dump would also be useful.

Paul_Dubs · September 8, 2015, 9:25am

Can you tell us more about your environment/use?

Right now it is used in a simple application that we built to get our head around CQRS and event sourcing. It is mostly an overengineered batch job. Twice per day it imports some files, validates them and hands them over for further processing to an other service which is polled regularly for updates of the processing state. The event stream (one per aggregate) goes into ES and the derived read model goes into PostgreSQL. With that we have just several hundred events per day.

The application is written in Java (8), but instead of using the official JVM client we decided to use the HTTP Interface. We write our events as JSON using the ‘Content-Type:“application/vnd.eventstore.events+json”’ header and retrieve them using ‘Accept:“application/json”’ and the ‘embed=tryHarder’ url parameter. We do not use the projection features of ES. The application itself runs with a limit of 512MB that the JVM may allocate.

Our events are mostly quite small, only the status update events get bigger (about 100kb), but we have only about 4 of them per day.

Paul_Dubs · September 8, 2015, 9:30am

process information. A memory dump would also be useful.

The memory dump would probably be several GB in size and contain some sensitive data, but I’ll try to get one.

What process information would you like?

Greg_Young1 · September 8, 2015, 11:34am

Anything you can get regarding image size etc.

jen20 · September 8, 2015, 12:15pm

We can give you an S3 bucket to upload into if that’s helpful, mail me off list to arrange this.

With the rates you’re talking about it’s either a serious bug or a configuration issue - we regularly run stress tests over days without seeing this. Are you running locally or in a cloud?

Paul_Dubs · September 8, 2015, 1:33pm

We can give you an S3 bucket to upload into if that’s helpful, mail me off list to arrange this.

When it happens again I can provide a link to the dump off list.

With the rates you’re talking about it’s either a serious bug or a configuration issue - we regularly run stress tests over days without seeing this. Are you running locally or in a cloud?

It is running locally in a VM on our own infrastructure (vSphere). I assume it to be a bug that we manage to trigger for some reason. Maybe it is because there is no real stress, but since the last restart it has been working as expected again (i.e. using only about 200mb memory, as it still runs with a disabled chunk cache).

jen20 · September 8, 2015, 1:49pm

Are you running with projections enabled? If so which ones are running? Also, can you try upgrading to 3.2.0, there are a number of changes there (see the release notes). You can acquire that on Ubuntu via the repository:


curl https://apt-oss.geteventstore.com/eventstore.key | sudo apt-key add -
echo "deb [arch=amd64] https://apt-oss.geteventstore.com/ubuntu/ trusty main" | sudo tee /etc/apt/sources.list.d/eventstore.list
sudo apt-get update
sudo apt-get install eventstore-oss

Never tested on vSphere, but I don’t imagine it should behave that differently to KVM or the like.

Paul_Dubs · September 8, 2015, 2:21pm

We are running with projections enabled, but only the default projections are running. We haven’t added any custom ones.

I’ve now upgraded to 3.2.0 and really hope that the bug that triggers the problem was quietly fixed somewhen since 3.0.5.

As I said earlier this behavior was only happening intermittently, sometimes going weeks before it appeared (and then coming up several times on the same day). When I see it the next time I’ll try to get a memory dump but let’s hope that what ever has been causing this is already fixed in the new version.

jen20 · September 8, 2015, 10:29pm

Is there any possibility your code was issuing unbounded reads? There are new protections in place against that (from 3.2.0)

Paul_Dubs · September 9, 2015, 3:58am

I’m not sure, as I don’t know how to issue an unbound read or how it differs from the normal reads. We usually just read from the end of the stream going forward till be beginning, always adding the “embed=tryHarder” get parameter.