Hi Dimitry,
- Which clusters have higher performance, the ones with less nodes due to the smaller internal communication overhead or the ones with more nodes?
Smaller clusters would have better write performance.
When a write is sent to the cluster, the master will write the events and replicate them to the other nodes in the cluster.
The master will then wait for a quorum of nodes (cluster_size/2 + 1) to acknowledge the write before considering the write successful. Therefore, the more nodes in the cluster, the more hops a write would need to make before completing.
- Can a cluster node be stopped for e.g. maintenance and restarted later ?
Yes, a node can be stopped and restarted. It will rejoin the cluster and continue from where it left off.
This allows you to do operations such as rolling upgrades, where you take a node down, upgrade it, and bring it back up to join the cluster before upgrading the next node.
As long as the cluster has at least a quorum of nodes alive at a time, the cluster will still be able to process writes.
For example, in a cluster of 3 nodes the quorum would be 2, so the cluster will be able to tolerate one node going down without losing write functionality.
If there are less than a quorum of nodes left, the cluster will still be able to service reads, but not writes.
- How long is the single node downtime allowed to be?
As mentioned in point 2, the cluster can continue to accept and process writes if there are at least a quorum of nodes available.
- Does Event store replicate all events stored during its downtime?
Replication happens live. Event Store follows a “shared nothing” philosophy, so each node has its own db.
As mentioned in point 1, the master will replicate the events to the slaves and expect a certain number of them to acknowledge the write before the write is considered successful.
- Can an IP address of a node in the cluster be changed when a cluster node goes down and up?
Yes. The nodes in a cluster make use of either a gossip seed or dns record to find the other nodes in the cluster.
When a node starts up, it will attempt to gossip with a node from the provided addresses.
If existing nodes get a gossip from an ip address they do not recognise, they will add it to their list of known nodes.
So as long as the node starting up with a new ip knows where the other nodes are in order to start the gossip, it will be able to join the cluster.
- What happens if we restore a database of one cluster node from a backup? Does Event store replicate all events not available in the backup data?
You can start a node in a cluster from a backup. The node will join the cluster and subscribe to the master node to replicate any data it is missing.
- How can we backup a complete cluster?
You only need to back up one of the nodes in the cluster. Instructions for backing up a node can be found here.
- Can a number of nodes in the cluster be increased and the quorum size be changed without stopping the cluster completely? This feature was discussed in some postings in 2014 but I have not seen it mentioned in the documentation yet.
The cluster size is fixed and cannot be changed on the fly.
- Which advantages do manager nodes have over AWS based autoscaling groups which can automatically restart database nodes?
These are not necessarily exclusive.
Manager nodes are only available for running Event Store on Windows. These nodes ensure that the Event Store service is always running, and restarts it if something happens - such as an offline truncation.
This functionality is covered on unix systems by systemd and the like.
AWS autoscaling groups work well for ensuring a specific number of instances are always running and for recovering from one of the instances going down.
- Which additional tools do come along with commercial support?