Details on EventStore clustering when running in cloud

Hi,

I am in the process of setting up an EventStore cluster for production, hosted on Google Cloud under Kubernetes. For now, we could get away with a single instance and persistent disks (I know this may affect latency, but latency/bandwidth is secondary to reliability and ease of use at this stage).

The question is, how does EventStore deal with the following:

  • Cluster consists of 3 nodes, each with its own IP

  • Each node has its own persistent disk backing it, which will survive on restarts

  • A node may be killed at any point, and recreated with a new IP (that will show up under the DNS name that the cluster is configured with)

Specifically, will the fact that a node may be recreated under a new IP cause problems, as it’s not “the same” node that comes back, or is it simply based on the position of the node that re-appears? Is there any risk in the cluster becoming unhealthy if all nodes suddenly die and get recreated?

Cheers,

Kristian

"Specifically, will the fact that a node may be recreated under a new
IP cause problems, as it's not "the same" node that comes back, or is
it simply based on the position of the node that re-appears? Is there
any risk in the cluster becoming unhealthy if all nodes suddenly die
and get recreated?"

So moving ips is no problem we do this all the time with availability groups.

For the second question, obviously if you lose all nodes in a cluster
simultaneously the cluster will not be very healthy. It should however
be fine when the nodes come back :slight_smile:

Cheers,

Greg

Kristian,

Are you still on the GCE + Kubernetes approach for running EventStore. Do you have advice/examples/tutorials you can share?

Thanks :slight_smile:

Jeff

+1 Kristian, we’re using kubernetes on aws, would be interested in hearing about ES on k8s

Hi,

I don’t have anything written up, but happy to share my experiences.

We are now running EventStore under Kubernetes on Google Cloud, AWS should be very similar. We have a 3 node cluster for our production environment (running in an isolated Kubernetes cluster), and an identical EventStore cluster for our “non production” Kubernetes cluster that hosts dev and test environments.

Each node has its own Kubernetes deployment, each with 1 replicas, I.e. we have three deployment files, identical except for names in each, to form the 3 node cluster. This was done before Kubernetes 1.3, PetSet looks like a nicer way to do this now.

The image used is based on https://github.com/EventStore/EventStore docker, with some tweaks, for example to enable system projections and start projections automatically.

We use Google persistent disks, ssd based, and take snapshots nightly. Persistent disks allow us to completely recreate the Kubernetes cluster, e.g. for Kubernetes upgrades, without having to worry about data loss. Killing nodes and Kubernetes pods “just works”, and the EventStore cluster recovers as expected.

EventStore is configured to use dns for initial peer discovery, and a Kubernetes service with clusterIP: none is created for this purpose (allowing EventStore to find the IPs of the individual pods, rather than a load balancing service proxy).

A separate Kubernetes service is set up that allows clients to connect to the cluster, this one with load balancing, to route clients to a random healthy node. We’ve also recently enabled cluster discovery in our clients (backend services), but may disable this for services that we know are read only (with regards to EventStore).

Overall things work well! One lesson learned is, persistent disk is very slow compared to local disk. This doesn’t cause issues for us in steady state as transaction rates are still low, but does mean that recovery when a node restarts is very slow, as checksums are checked and indexes rebuilt.

So far this all runs in a single availability zone, on the road map is setting up a kubernetes cluster spanning zones, and spread the EvrngStore nodes across zones, though I’m not sure how badly this will affect latency.

Cheers,
Kristian

Thanks for this info.

You can probably disable the chunk verification via config (likely you
don't need it as disks are most likely already fault tolerant to a
high level iirc AWS is something like 15 9s.

Also 3.9.0 has some improvements in it that should roughly double the
speed of loading indexes.

Cheers,

Greg

Thank you, we may try turning this off.

I understand chunk verification is used to ensure the integrity of the data read from disk, but what happens when/if this fails? Does the node effectively lose those events and relies on some other node in the cluster having the data for them?

Cheers,

Kristian

It basically testing the files to see if corruption has happened at a
storage level (checking hashes). On error the node will dies saying
storage corruption has occurered, you would then manually restore from
back up / tell it to re-replicate/etc (depends on your scenario which
is best).