Cluster quorum and read/write

Andy_Longshaw · February 27, 2015, 3:15pm

We are trying to confirm our understanding of the (eventual) consistency of data in a cluster to help inform the code we write around it. We have looked at the documentation and blogs (including https://geteventstore.com/blog/20130301/ensuring-writes-multi-node-replication/index.html and http://docs.geteventstore.com/server/3.0.0/cluster-with-manager-nodes/) and done some searches on this group looking for the same topic but could not find a match. If this has already been discussed then apologies and a pointer to an answer would be appreciated.

We are using an eventstore cluster for eventsourcing and individual or load-balanced services which maintain in-memory read models.

Our understanding of the world is as follows. Please could you identify any incorrect assumptions:

At any one time, any given client could be taking to any single node in the cluster. It will use that node for reading, writing and subscriptions.
In any cluster, there is one master which is the only one that actually processes client write requests. All other cluster members will forward client write requests to the master.
In a cluster of N nodes, the master will forward the write request to N/2 other nodes as a set of node-to-node write requests and will not consider the client write request to be complete unless all of these node-to-node write requests succeed.
Once the client write request has been successfully persisted on N/2 + 1 nodes, the master will not make any more specific efforts to persist this client write request data on the remaining nodes but will leave it up to the gossip protocol to propagate it.

Based on this:

Until the gossip protocol has propagated the client write request data, a given client could be talking to a node that does not have the latest data.
If there are two eventstore clients talking to different nodes and those clients are instances of a service that maintain the same in-memory read model, then their read models could be out of sync.
This means that the clients of those services could get different answers from the read model based on which instance of the service they talk to.
The clients of the read model services (and anything else further towards the user) just has to deal with the potential temporary inconsistency (which is just another form of eventual consistency).

There is no issue around dealing with the eventual consistency around the read models (after all, one service could be running a lot slower processing events than another one).

However, if we needed to reconstitute an aggregate based on the current state of events to determine if a CQRS command is valid, then this could be something we have to consider. Hence we wanted to make sure our understanding of how this works is right so we are not working on an incorrect assumption.

Thanks

Andy

Greg_Young1 · February 27, 2015, 4:08pm

Inline responses.

We are trying to confirm our understanding of the (eventual) consistency of
data in a cluster to help inform the code we write around it. We have looked
at the documentation and blogs (including
https://geteventstore.com/blog/20130301/ensuring-writes-multi-node-replication/index.html
and http://docs.geteventstore.com/server/3.0.0/cluster-with-manager-nodes/)
and done some searches on this group looking for the same topic but could
not find a match. If this has already been discussed then apologies and a
pointer to an answer would be appreciated.

We are using an eventstore cluster for eventsourcing and individual or
load-balanced services which maintain in-memory read models.

Our understanding of the world is as follows. Please could you identify any
incorrect assumptions:

* At any one time, any given client could be taking to any single node in
the cluster. It will use that node for reading, writing and subscriptions.

Yes.

* In any cluster, there is one master which is the only one that actually
processes client write requests. All other cluster members will forward
client write requests to the master.

Yes.

* In a cluster of N nodes, the master will forward the write request to N/2
other nodes as a set of node-to-node write requests and will not consider
the client write request to be complete unless all of these node-to-node
write requests succeed.

The master will forward to all nodes and will consider it to be done
when n/2+1 have agreed its been done.

* Once the client write request has been successfully persisted on N/2 + 1
nodes, the master will not make any more specific efforts to persist this
client write request data on the remaining nodes but will leave it up to the
gossip protocol to propagate it.

Not quite. The replication uses log shipping. The other nodes will
actively be trying to get it. The gossip protocol is mainly being used
for node failure detection.

Based on this:

* Until the gossip protocol has propagated the client write request data, a
given client could be talking to a node that does not have the latest data.

Not because of gossip protocol but yes any node could be in a minority
partition and may be eventually consistent state.

* If there are two eventstore clients talking to different nodes and those
clients are instances of a service that maintain the same in-memory read
model, then their read models could be out of sync.

This is always the case simply due to network latency It takes 10ms
to get the packet to you 500ms to get the packet to another machine.

* This means that the clients of those services could get different answers
from the read model based on which instance of the service they talk to.

* The clients of the read model services (and anything else further towards
the user) just has to deal with the potential temporary inconsistency (which
is just another form of eventual consistency).

You can make sure single node in a minority cluster does not cause an issue.

Connect a subscription to the same thing in n/2+1 nodes and dedupe
between them. This will assure that you always are getting writes
quickly.

There is no issue around dealing with the eventual consistency around the
read models (after all, one service could be running a lot slower processing
events than another one).

However, if we needed to reconstitute an aggregate based on the current
state of events to determine if a CQRS command is valid, then this *could*
be something we have to consider. Hence we wanted to make sure our
understanding of how this works is right so we are not working on an
incorrect assumption.

If you are using expected version you will never have a problem. If
you load up to v7 because your node is in a minority partition and the
rest is at v10.... The write will go to your node (lets imagine it
just rejoined so the forward actually works....) it will go to the
master who will see its at version 10 and your expected is v7 so it
will give you an optimistic concurrency exception

Andy_Longshaw · March 4, 2015, 11:47am

Thanks Greg. This helps a lot. It is the expected version that was the missing piece of the jigsaw in the scenario we were thinking of.