Hi all,
We are currently running an EventStore cluster to store the history of our most important business objects and to subscribe external-facing applications to events happening in the core business domains.
Our cluster runs on 3 c4.large instances in AWS with 677 GB of EBS storage each. It stores 465 GB of data and processes around 10 events per second
during normal operations.
We notice that after
a node has gone down, it takes EventStore 2h 40min to start up again. A
few weeks ago, two of the three nodes went down at once, so we had to wait more than two hours before EventStore was available again.
Is
there something we could do to make such long cluster outages less likely? Is it recommended to run with --skip-db-verify in production? Do
people have experiences configuring cloud environments to minimize the chance of cluster nodes going down at the same time?
Our current configuration parameters, for the record: