Performance Tuning Guide/Resources for Persistent Subscriptions

Jon_Doherty · April 8, 2019, 11:11pm

Hello Everyone,

Are there any resources available for suggestions on tuning the performance of EventStore when using persistent subscriptions?

Recently, we consumed a very large data set that created over a million events. As our different subscribers started processing these events, we saw the EventStore instance begin to slow down. Eventually, we could no longer view the persistent subscription page, and started seeing many messages begin to timeout and get parked. Our assumption is that we had set the Buffer Size of the workers too high, in combination with the number of workers being too high and the Message Timeout being set too low. We also tried tweaking with some of the other parameters (Live Buffer Size, Buffer Size, and Read Batch Size), but they didn’t seem to have much effect.

It would be helpful if there was a guide of sorts that talked about the different parameters available, and an approach to tweaking them to find a good balance. Does something like that exist?

Thanks,

Jon

Greg_Young1 · April 8, 2019, 11:25pm

What were your settings? hard to say too high/too low etc without knowing operating environment.

Jon_Doherty · April 8, 2019, 11:41pm

For the workers, each had a buffer size of 50. We had 10 workers attached to the stream.
For the Group, we had:

Message Timeout - 60000
Live Buffer Size - 500
Buffer Size - 500
Read Batch Size - 20
Checkpoint After - 2000
Min Checkpoint Count - 10
Max Checkpoint Count - 1000
Consumer Strategy - Round Robin

Greg_Young1 · April 9, 2019, 12:14am

Are there any unusual things about these messages such as being 3MB each etc? Are you running in an iops limited environment? This seems at first glance like a reasonable setup. How long is processing generally per message?

Jon_Doherty · April 9, 2019, 2:58pm

The messages are fairly small, only a few KB each. The stream itself is a $ce_ stream created using the $by_category projection, with roughly 620k streams behind it.

To our knowledge, we are not iops limited. We can manage the infrastructure (kubernetes) but do not have full control over it. We are working to get a more definitive answer on this.

Is there a good way to determine the run time of a given message or an average run time for messages? At this point we are speculating that it may take upwards of a second for many of those messages because of a non-optimized mongodb query. Not every message was taking that long, but a large number were.

James_Connor · April 9, 2019, 3:10pm

You might want to look at how quickly the $ce- event is emitted vs when the original event hits your stream -

We have developed our own metrics and alerting for this stat as we have been caught out with it in the past (one cause was non straightforward memory starvation because of weird OS disk caching).

– You received this message because you are subscribed to the Google Groups “Event Store” group.
To unsubscribe from this group and stop receiving emails from it, send an email to
.
For more options, visit .

Jon_Doherty · April 15, 2019, 7:59pm

We got an answer back on the iops limiting: There are currently no limits in place.

I am still unclear as to where to collect any stats based on Greg’s question (run time per message), and really how to approach the performance tuning aspect of our setup.

James: We can look to see if we can report on the scenario you provided. How did you go about diagnosing the OS disk catching issue, and how did you fix it?