Disaster recovery plan compliance

Clement_Bouillier · February 28, 2017, 12:08pm

Hello,

We have to take care of the disaster recovery plan in an infra with 2 different datacenters locations. We need to be able to keep running the application even if one of 2 datacenters is down/disappear :).

So, we were wondering how to install EventStore in this context.

Only one database node is not enough, since if the datacenter running it crash, we lose the ES.
A 3-nodes ES means we have 2 nodes on datacenter 1 and 1 node on datacenter 2, then if loose datacenter 1, we loose 2 nodes and ES is no more able to accept writes.
Would it possible to use one ES node on each datacenter, sharing the same storage (with huge SAN infra, yes, we know that’s not optimal…), one being active and the other one activated only if the first one fail?

Thanks for your feedbacks.

Regards

Clément

Clement_Bouillier · March 1, 2017, 7:26pm

Hey, perhaps the title of my topic has frighten everyone ^^.

I thought about it a bit more and I was wondering if I could just rely on a 2 nodes cluster (one of each DC) to have replication (and avoid SAN storage then), and if one DC fail (it happened but hopefully very rare), then we just need to “manually” restart the remaining node in single node mode. Sounds right to you?

Thanks

Greg_Young1 · March 1, 2017, 7:36pm

With the new replication model coming out you will be able to run
multi-master in this circumstance. In the current model you need an
odd number otherwise one data center will always be possible to be in
a minority

Clement_Bouillier · March 1, 2017, 7:47pm

Thanks Greg for your feedbacks.

I understood I need 2N+1 nodes to guarantee failure of N nodes.

In the last case I propose, I thought to use a 2 nodes cluster just to have replication between DC (instead of doing incremental backup, cf. other topic). And then, I could manually restart the surviving node in single node mode in case of DC problem (or add a custom health check that does the restart…). Is it realistic?

Cool for next version :). Does it means that we will be able to have even number of nodes?

Clement_Bouillier · March 1, 2017, 7:47pm

Thanks Greg for your feedbacks.

I understood I need 2N+1 nodes to guarantee failure of N nodes.

In the last case I propose, I thought to use a 2 nodes cluster just to have replication between DC (instead of doing incremental backup, cf. other topic). And then, I could manually restart the surviving node in single node mode in case of DC problem (or add a custom health check that does the restart…). Is it realistic?

Cool for next version :). Does it means that we will be able to have even number of nodes?

Greg_Young1 · March 1, 2017, 7:53pm

"Cool for next version :). Does it means that we will be able to have
even number of nodes?"

Yes and write to both. It is a multi-master model. You/we could
implement similar very quickly for your specific situation as opposed
to in general if needed quickly, drop me a note off list if this is
needed.

jen20 · March 1, 2017, 8:32pm

In the last case I propose, I thought to use a 2 nodes cluster just to have replication between DC (instead of doing incremental backup, cf. other topic). And then, I could manually restart the surviving node in single node mode in case of DC problem (or add a custom health check that does the restart…). Is it realistic?

Not with the current replication model, no. You rely on a majority of nodes being available in a cluster, and so in a cluster of two, both must be available since 1 is not a majority. If we enabled them to both write, you could end up in nasty split brain situations. However, that might be acceptable, in which case run two clusters of one and replicate yourself.

Clement_Bouillier · March 1, 2017, 9:27pm

What do you mean by “If we enabled them to both write”, do we have to send write explicitly to the master ? I mean with the HTTP client, how would I know which node to contact (given I use DNS for example) ?

I am not sure to understand correctly this master/slave thing, I thought that when I send a write (to any node), ES need that a majority ack the write (without talking about master/slave), isn’t it?

I understood the problem of majority in cluster with even nodes number. My idea was having a manual fallback to single node cluster with the survival node, in case on emergency.

Whan you say “replicate yourself”, you mean with continuous/incremental backup/restore ?

Greg_Young1 · March 1, 2017, 9:33pm

"Whan you say "replicate yourself", you mean with
continuous/incremental backup/restore ?"

He is referring to something similar to what I mentioned earlier

Jonathan_Curtis · March 9, 2017, 3:19pm

When is this replication model coming out?

Greg_Young1 · March 9, 2017, 3:21pm

Shortly after 4.0 my guess (though its me sticking my thumb in my butt
and telling you it will rain next Thursday) is within 60-90 days.