Running with two datacenters

Jonathan_Curtis · June 22, 2015, 5:07pm

Looking at possibly running a quorum with two datacenters, with two nodes in one and one node in the other. I’m aware we could lose data if the one with two went down, but what is the risk? What is the possible delay for all nodes to be consistent?

Is this a good idea? ie we can tolerate the possibility of some loss, but want to minimise the risk.

Sam_Kanai · June 22, 2015, 10:32pm

Careful with the word “lose.” In theory, you can’t lose data (forever) once it is acked by the master. Obviously, the data may not be visible to you if the server you are talking to is partitioned in the minority. But that data will eventually become visible.

The delay for consistency (convergence) is indeterminate. As soon as “lost” node rejoins the majority it will stream down updates. That depends on all the common factors: data volume, network speed, disk speed. In my limited experience, ES is reasonably efficient here.

A 2/1 split across DCs strikes me as suboptimal, due to performance and availability. ES is a single-master system: writes must be forwarded to one node. There usually are cost, performance, and complexity issues with having the master role move between DCs without warning. Your mileage my vary of course.

Traditionally, I’d prefer to run a complete cluster within a DC, and send all writes there. I’d then set up read-only replicas in other DCs. Alas, this isn’t yet natively supported in ES, AFAIK. Eventually we’ll be able to constrain slaves to be non-electable Clones, but I don’t think that is an official feature yet.

Jonathan_Curtis · June 23, 2015, 2:24pm

You can certainly lose data if the datacenter went up in flames. If you want cross-datacenter redundancy, then you must have acknowledged writes in at least 2 datacenters, which means ideally means running a quorum in 3 datacenters.

If you have read-only replicas, you can lose data in the replication process.

My specific question is given nodes A & B in datacenter X and node C in datacenter Y, if a write is acknowledged by A & B, how long until C sees it? If datacenter X is destroyed, what is the window of possible data loss? Is this mode of operation a good idea, given our requirements (we currently only have 2 datacenters and can possibly tolerate a small risk of data loss)?

My more general question is what are the options and trades-off running eventstore in two datacenters?

Jonathan_Curtis · June 23, 2015, 3:21pm

Just to give a bit more context, we currently run our system in a D/R configuration across two datacenters, with a SAN in each datacenter and failover everything in the event of failure. Short term we don’t have the option of a third datacenter. Interested in running event store in a quorum so we don’t have data loss in the event of failure.

Would appreciate advise.

Greg_Young1 · June 24, 2015, 11:06am

2 data centers will always have issues do the nature of the problem.
However what I know many are doing is 2 dcs (primary and secondary)
then a "weak node" someplace else that allows quorums to be made.

Jonathan_Curtis · June 24, 2015, 11:14am

OK, but if we can tolerate the chance of small amount of data loss, then is running 3 nodes across 2 data centers better than replicating the events through another process (listening to $all)?

If the datacenter with 2 nodes goes down, we can restart the node in the other datacenter running without quorum until we restore the other datacenter. This seems like less moving parts.

Greg_Young1 · June 24, 2015, 11:15am

So the issue you run into is known as split brain (which the quorum
solves). What happens if the communication between the two data
centers goes down?

Jonathan_Curtis · June 24, 2015, 12:33pm

So, in our scenario, you would advise running a quorum of 3 nodes across the two datacenters? (accepting that we could lose some data if the datacenter with 2 nodes went down)

ahjohannessen · June 24, 2015, 12:39pm

Jonathan, then why not quorum of 5? I would consider that as a much better setup with regards to robustness, given two DCs.

ahjohannessen · June 24, 2015, 12:40pm

On a second thought, perhaps not. If one DC with 2 nodes blows up, well then I guess you are out of luck

ahjohannessen · June 24, 2015, 12:42pm

Three nodes, that is too tired today, I guess

Jonathan_Curtis · June 24, 2015, 12:46pm

Hah, yes, I understood what you meant

The main point here is that we are happy for the system to go down for a short period if a datacenter goes down and even lose some data - that happens anyway with our D/R setup. We just want to have as robust a setup as is possible given the constraints.

Anymore comments on why 3 nodes running quorum across 2 datacenters would not be a sensible plan?

Greg_Young1 · June 24, 2015, 12:51pm

I would do two data centers + a weak node (depending on message
volumes etc) in the cloud etc. We have a card to actually make this
possible to be a much weaker node by making that other node just a
"witness" as opposed to a full node.

Jonathan_Curtis · June 24, 2015, 12:53pm

Unfortunately we have legal restrictions to running a node in the cloud.

ahjohannessen · June 24, 2015, 12:55pm

I think it is a fine thing considering your constraints and when we consider a whole DC going south.

Greg_Young1 · June 24, 2015, 12:55pm

office

The problem is simple. Let's say I setup 3 nodes in two data centers....

I lose connectivity. Only one side is ever capable of making a quorum
(the other will always have a minority). What happens when the failure
is the side that can make a quorum? This also happens with 5 or even 9
nodes (only one side has a quorum!)

If just doing two data centers I would run as two separate instances
and push from one to the other then do a manual switch in the case of
a failure (with a manual resolve process)

ahjohannessen · June 24, 2015, 1:00pm

I like that reasoning. Especially the fact that only one side can ever have quorum.

Jonathan_Curtis · June 24, 2015, 1:02pm

By two separate instances do you mean two sets of 3 nodes? The advantage of running quorum is that we can tolerate individual node failures. We care more about this than the datacenter failure really, in terms of operational cost and downtime.

Jonathan_Curtis · June 29, 2015, 9:30am

So, I believe the options are this:

Run 3 nodes across 2 datacenters. If datacenter with 2 nodes goes down, we can’t accept writes until we manually restart either the downed datacenter nodes or reconfigure the single node to run as a single instance. Data loss is possible, albeit minimal.
Run 3 nodes in each datacenter and create separate process that subscribes to $all and pushes events to the other datacenter.
As 2, but only run a single node in each datacenter.

We would prefer (1) or (2), as then we can tolerate a single node failure without immediate manual intervention. (1) seems preferable over (2), as with (2) we would require the separate sync process to be highly available too.

Is this summary reasonable?

Greg_Young1 · June 29, 2015, 9:38am

1. Run 3 nodes across 2 datacenters. If datacenter with 2 nodes goes
down, we can't accept writes until we manually restart either the
downed datacenter nodes or reconfigure the single node to run as a
single instance. Data loss is possible, albeit minimal.

No. If you manually decide the minority partition should just "accept
writes" you have to manually put the cluster back together later. I
don't think you are understanding how this works.

A quorum based system requires a majority. Let's say that you have
three nodes a1 b1 c2 (number indicated DC).

If you were to get a failure in c2 (a1 and b1 would maintain a cluster
no problem). Everything would be automatic when c2 came back.

If you were to get a failure in DC1 (e.g. a1 and b1 are gone). c2
would not be able to make a majority with anyone and would say "wait'.
It would not come back again until it could talk to a1 or b1. You
could manually shut it down and say "you are now master of a cluster
size of 1" but this will be a mess when a1 and b1 come back and you
would need to manually sort out what happened (remember that a1 and b1
likely thought they still had a cluster when you did this and could
have acknowledged same or a different set of writes!)