Long start-up times in production cluster

Hi all,

We are currently running an EventStore cluster to store the history of our most important business objects and to subscribe external-facing applications to events happening in the core business domains.

Our cluster runs on 3 c4.large instances in AWS with 677 GB of EBS storage each. It stores 465 GB of data and processes around 10 events per second
during normal operations.

We notice that after
a node has gone down, it takes EventStore 2h 40min to start up again. A
few weeks ago, two of the three nodes went down at once, so we had to wait more than two hours before EventStore was available again.

Is
there something we could do to make such long cluster outages less likely? Is it recommended to run with --skip-db-verify in production? Do
people have experiences configuring cloud environments to minimize the chance of cluster nodes going down at the same time?

Our current configuration parameters, for the record:

db verify is going to do 700 GB of file operations in your case on startup! It does it in the background but especially in an iops limited environment this is an expensive operation

We’ve been here and there’s a few options:

The node cannot start until the index is loaded. Make sure the index has been migrated to the latest version. There was an improvement to loading time via writing more data to the index file. The logs on start up will tell you which version each index table is if I remember correctly. See https://eventstore.org/docs/server/64-bit-index/index.html on how to do a rebuild to “upgrade” if needed.

We added a new option -InitializationThreads via PR that parallelises the chunk open process. This will only help if you are not IO bound though.

We do run in prod with --skip-db-verify but we run raid 10 as an alternative protection to disk corruption. Not sure if this is recommended though.

Just to add a bit here:

–skip-db-verify

So skip-db-verify should be safe to run in most places. Basically the db-verify is going through doing checksum validations across the entire db. Many file systems/many disk systems already have such validations occurring underneath anyways. Beyond that the failures should be rather unlikely, likely backups/multiple nodes are often enough. I have gone back and forth on making this the default.

The reason why it is nice to have is that when it does catch an issue its likely one that will occur in other places and it may be some time until someone notices in some other way (EG it will often still work even with problems underneath unless you get a read where the problem is). It is also possible to not run it in production but run it on a backup when a backup occurs which provides 90+% of the value of doing it on startup.

Does that make sense?