Questions on EventStore performance and threading architecture

I’ve recently been doing some performance tuning and investigation on EventStore to deal with reliability issues (primarily connections being reset from heartbeat timeouts, and intermittent high latency). I would very much appreciate replies to a few questions below to better understand the factors going into this.

For context, we’re running EventStore 4.1.4 in a 3 node cluster in Docker under Kubernetes with Google Cloud Persistent Disks (network storage). While I am aware that network storage is not the recommended setup, it does make our lives easier when it comes to redundancy/backups, and for the most part has been good enough.

To make the latency more smooth, we’ve tested various options for the cache sizes and thread setting, initially increasing both the reader thread count and worker thread count. However, increasing the worker threads seem if anything to have a negative impact on the average latency and jitter. Can you confirm that worker threads are only CPU bound so that there is no benefit to having more worker threads than cores?

Looking at the number of reader threads, we don’t yet have conclusive results, but from the source code it does appear that the storage workers do blocking disk IO (which for us means network IO) and that as a result there would be benefits to adding more reader threads to reduce the impact of occasional spikes in storage access latency. Are there any other points of contention (e.g. in accessing underlying caches/files) that would lead to diminishing returns from adding more reader threads?

Additionally, I’ve noticed that while the config options for the above items talk about reader and worker threads, they’re not strictly allocated threads, but instead thread pool workers accessed on demand for as long as the queue is busy. Given the small number of threads, is there any reason why readers and workers are not implemented as dedicated threads? While the impact may be small, this would presumably remove any contention on the thread pool itself, and would mean that the workers could be blocked on a single lock when idle, rather than a new thread being claimed each time they start.

Finally, I’ve noticed that the queues are internally sent messages using a simple round robin scheme (when “better ordering” / queue affinity is not used). While this will be fine assuming that the latency for processing individual work items is small and averages out quickly, it does mean that a single slow item can currently build up a queue of unprocessed events that are being held up, even if other queues are empty. Is there are particular reason (other than simplicity) for why this is not implemented in a way that favours queues with fewer items in them, or uses something like work stealing where idle workers attempt to take work off a backed up queue? As it stands, it seems like the only way of reducing the impact of individual slow items is to add more readers, which would make it less likely that an incoming message is held up by another individual slow operation, but wouldn’t eliminate it in the way that feeding the current shortest queue would.

I appreciate that this is a long message with a lot of questions, but would be grateful for replies to better understand whether the best way forward for us would be to fork EventStore, to submit some PRs, or whether the areas above on improvements are already known dead ends for reasons I’m missing.