Hi,
we’re running Eventstore 4.1.3 on Azure Kubernetes Service (AKS), and with the increasing amount of data, restarting Eventstore takes longer and longer – with something between 30 minutes up to 2 hours for a node for about ~120GB of data. With an Eventstore cluster size of 5, rolling out configuration changes or updating nodes (Eventstore or underlying Kubernetes nodes) currently takes around 6 hours or more currently.
While this seems largely related to performance characteristics of the underlying “Premium Disks” (or the disk access patterns of Eventstore not being optimal for Azure, as the Azure support sees it), I wonder how to work with this, and what options I have to improve on the given environment.
Now I took a look at the various configuration options Eventstore gives me that might (or might not) affect startup speed specifically (runtime performance being the second to analyze later)
On a development server, the following options increased startup time dramatically (back to just a few minutes from start to being ready).
EVENTSTORE_OPTIMIZE_INDEX_MERGE: “True”
EVENTSTORE_SKIP_INDEX_VERIFY: “True”
EVENTSTORE_SKIP_DB_VERIFY: “True”
I have yet to find which of the options actually gives the observed boost, or if it is the combination.
What do these options mean exactly, and **is it actually safe to enable them (on production)? **
I read the docs saying the following, but could anyone give me more details on that? Are the operations behind this expensive (IO or CPU-wise)?
Does disabling them actually skip a step forever (or just shift them to later when the data is actually accessed)?
OptimizeIndexMerge Bypasses the checking of file hashes of indexes during startup and after index merges. (Default: False)
SkipIndexVerify Skips reading and verification of PTables during start-up. (Default: False)
SkipDbVerify Bypasses the checking of file hashes of database during startup (allows for faster startup). (Default: False)
A second question: is there a way to increase the log level of Eventstore (esp. for disk IO related things), so I could get a better understanding of what takes so long?
Best regards,
Markus