Event Store eating up all most available memory

jen20 · September 9, 2015, 2:11pm

Are you following the links or constructing the URIs?

Paul_Dubs · September 10, 2015, 8:36am

We are following the links but add the embed parameter to them (through an URL builder, not string concatenation)

Paul_Dubs · October 3, 2015, 9:06pm

After a relatively long time of no problems, it just happend again. Now the installed version is 3.2.1 and I managed to get a core dump (4.2 GB in size) before the OOM Killer got to it.

So, how can I provide it to you? As it contains some sensitive data, I can’t just post it publicly.

jen20 · October 3, 2015, 10:05pm

If you share it via DropBox or similar you can email myself, Greg or Pieter (firstnames @ geteventstore.com respectively) the URL.

Thanks,

James

Paul_Dubs · October 4, 2015, 7:32am

As it happened again over night, I have created a second core dump and collected some more information via pmap.

I’m compressing it for upload it as I write this.

Interestingly the (first) memory image is very compressible. Starting at over 4GB it compressed down to just 84MB. So it seems all of that allocated memory is probably empty or has mostly the same pattern of data in it.

Paul_Dubs · October 4, 2015, 7:19pm

Another interesting data point: It seems that the problem clusters around the beginning of a month, and when it happens it tends to happen several times per day. And then the problem has gone away, it stays away for for some time.

Greg_Young1 · October 4, 2015, 7:58pm

Is there anything in your workload changing at the beginning of the
month? Or in server maintenance etc?

I am also curious why your reads are taking so long as in :

[01449,87,07:31:33.045] SLOW QUEUE MSG [StorageReaderQueue #3]:
ReadStreamEventsBackward - 324ms. Q: 0/0.

Are you seeing the same now? as before? This is a VERY long time for a
read to complete for any reasonable read

Paul_Dubs · October 5, 2015, 11:35am

Is there anything in your workload changing at the beginning of the

month? Or in server maintenance etc?
Nothing that runs on the same storage. The only thing that has real usage on it is the application that we use along with ES.

I am also curious why your reads are taking so long as in :

[01449,87,07:31:33.045] SLOW QUEUE MSG [StorageReaderQueue #3]:

ReadStreamEventsBackward - 324ms. Q: 0/0.

Are you seeing the same now? as before? This is a VERY long time for a

read to complete for any reasonable read

I cant figure out why it sometimes is so slow. Even the highest peaks that we see in our logging of access times are at least two orders of magnitude lower.

Greg_Young1 · October 5, 2015, 12:19pm

My guess is you are only doing a read or two/second as well so this
should not be gc etc related as it seems to be every read is going
this speed. Or perhaps you are doing lots of reads and I just
misunderstood the workload? About what is the size of your events and
what reads are you doing?

Paul_Dubs · October 6, 2015, 10:34am

My guess is you are only doing a read or two/second as well so this
should not be gc etc related as it seems to be every read is going
this speed. Or perhaps you are doing lots of reads and I just
misunderstood the workload? About what is the size of your events and
what reads are you doing?

Actually we are not using ES all that much. Queries only touch a read model that is persisted in a postgres database (which happily runs with its default configuration, currently using 15MB of memory). Every half an hour a command comes along (executed via a cronjob), that checks the status of a process running within another service. It will store the result as an event (this result is the biggest event, and is about 100K of JSON) and depending on the status, it will also cause some more commands to be run (up to 250) which in turn will result in another event per command (about 200 bytes). And then it will be silent for another 30 minutes.

The only other commands that will come in are two times per day, and they result in several hundred events (at most about 300; each about 200bytes).

We use an event stream per aggregate. The small aggregates have usually only 3 or 4 events (each sized about 200bytes) and the larger ones have usually less then 4 but can also have up to about 150 (each sized about 100K) if the work for the other service is not done before the weekend.

Greg_Young1 · October 6, 2015, 11:14am

In your log you are getting a read at least once per second

Paul_Dubs · October 6, 2015, 11:28am

In your log you are getting a read at least once per second

That is during the time that ES eats all the memory, even when the application that uses it is not running. ES on the other hand also uses 100% CPU.

Greg_Young1 · October 6, 2015, 11:46am

Yes so what is doing the reads?

Are you running projections etc?

Paul_Dubs · October 6, 2015, 11:53am

Yes so what is doing the reads?

Are you running projections etc?

Projections are enabled, even though we don’t have defined any custom ones.

Paul_Dubs · October 16, 2015, 7:35am

Now it happened again and the log once again contained “SLOW QUEUE MSG” entries. So maybe that is what is triggering the strange behavior. But still, ES shouldn’t go crazy and start allocating all available memory.

Greg_Young1 · October 16, 2015, 9:07am

Log? I am curious to see if you are also getting the constant load
without having any load

Paul_Dubs · October 16, 2015, 9:41am

Log?
The log ES writes to /var/log/eventstore/.

I am curious to see if you are also getting the constant load

without having any load

It seems like it and as I said projections are enabled, even though we haven’t defined any custom ones. They are enabled so I can take a look into the $by_event_type projection manually.

Greg_Young1 · October 16, 2015, 9:44am

You are the only we one we know experiencing this issue so I am trying
to figure out what might be different about your workloads.

Paul_Dubs · October 16, 2015, 9:54am

I’d be glad to help, but besides the memory dump, for which I sent the link to James on the 4. October, I don’t really know what else I can do.

Brenton_Annan · April 25, 2016, 10:39pm

Hey Paul, did you find out what was causing this issue for you? I’ve just had a long-running (about 2 months since last time I logged onto it) staging server with pretty much zero utilisation crash from OOM. If you didn’t manage to resolve it, have you continued to have the same problem?