Firstly the TL;DR of the below detail is that we’re seeing really poor event store performance when writing new events “as quickly as possibly”. There is a load more information I can provide on request or publish somewhere else and link to (e.g. logs, segfault/crash info, CPU/disk stats from the host, graphs of performance metrics, etc).
- We are using eventstore to track all changes in “Products” so that we have a full audit trail/history of every Product in the system.
- The current state of a Product is obtained at regular intervals (typically daily, but can equally be weekly or hourly).
- The “commands” that represent these current Product states arrive in a message queue that our “Product Command Processor” handles
- For each command the command processor will
- Read all existing events in eventstore for the Product in question to reconstruct the aggregate
- (If there are no events for this Product, emit a ProductCreated event with containing the full Product detail)
- Compare the aggregate state to the state in the incoming message and emit the new events describing the state change
- We have the by_category system projection enabled so that we can have projections that subscribe to $ce-product in order to stay up to date with all Product events
- We get approximately 12 million product updates per day
- There are approximately 50 million products in existence (this grows steadily as new ones are found)
- We write approximately 35 million events (25 GB of event data) per day based on the detected Product changes
- We running on a single node linux VM EC2 instance in AWS and have tried a few different flavours:
- c5n.4xlarge with “vanilla” SSD EBS drive
- c5n.4xlarge with “high iops” SSD EBS drive
- i3en.2xlarge with dedicated local SSD for the event data location (disk iops is even higher - currently running this one but still seeing the same issues)
- We run a single instance of Event Store (linux version) on this single EC2
- We sometimes run multiple instance of the command processor (in an effort to squeeze more performance from ES)
- In general we find things run smoothly when we start out and everything is keeping up with the incoming data stream nicely
- We run into issues when we try and populate eventstore more quickly:
- by running through the last 24 hours of commands which we have in a kafka topic
- or, if we have had some sort of error or redeployment of the command processor meaning there is a backlog of commands to process
- Our persistent subscriptions use the default settings and use the $ce-product stream
- When our command processor is trying to catch up with a backlog, i.e. trying to go as fast as possible (trying to write new events, but also doing reads to reconstruct aggregates as part of that process)
Persistent subscriptions grind to a halt
- On the UI, the Persistent Subscriptions page is empty.
- On the dashboard there is a large build up of work in the PersistentSubscriptions queue (other queues are fine). Observed that even if we stop all our external processes (no requests in or out of ES) this queue is processed at a pitiful rate (e.g. <100 item/s)
- The system projection starts to fall behind (note that even with this turned off temporarily, we generally don’t see an improvement in event writing performance)
- Consumers of those subscriptions (our own projections) initially get starved (maybe < 10 events per second) and soon after the subscription just dies and those consumers can’t reconnect
- UI becomes unresponsive and/or get regular “an error has occurred” banners
- We can’t monitor what’s happening via grafana as the prometheus request to scrape the metrics from the eventstore endpoint times out with this message : time=“2019-10-22T15:37:04Z” level=error msg=“Error while getting data from EventStore” error=“HTTP call to http://172.xx.xx.xx:2113/subscriptions resulted in status code 408”
- Recovery happens when the command processor rate slows down to something more pedestrian, once the system projection and the PersistentSubcriptions queue catches up, the subscriptions re-appear in the UI and the consumers can connect again
- Persistent subscriptions grind to a halt
- We’re trying/hoping to process Product updates at a few thousand messages per second at least (to catch up with a backlog) rather than the low hundreds which is what we tend to see (this is only just about enough to keep up with live data).
- We regularly see “SLOW QUEUE MSG” and “SLOW BUS MSG” in the event store logs. When things are really bad we also get “VERY SLOW QUEUE MSG” in the error logs. We don’t truly know what this is really telling us and how to react, though obviously it is somewhat of an indicator that we are putting eventstore under stress.
- We observe that the eventstore machine doesn’t seem to make full use of the resources available to it.
- Typically when we see poor performance we might see 25-50% CPU utilization and low disk iops (5k reads/writes per second) compared to what the instances should be capable of.
- … but we have seen much higher figures when things are going smoothly (CPU around 60% and up to 150,000 disk iops)
- Generally not getting great performance when reading a system category projection such as $ce-product, as compared to a raw stream of events. Suppose this is expected to some extent as the former has to resolve the linked events?