How frequent should clusters change?

Ian_Sleigh · April 29, 2015, 9:36am

I am running a cluster of 3 eventstore nodes (3.0.1) and notice in the log that there are a lot of messages about clusters changing (approx 150 over a 10 hour period) - see below for an example

Is this normal and something we can safely ignore or should I be looking for an underlying cause?

Greg_Young1 · April 29, 2015, 9:42am

What are your timeouts etc set to and what is your running environment? By default they are set reasonably low and small latency spikes in networking can cause nodes to report being dead especially in cloud circumstances. Normally it would just resolve itself as you see.

Ian_Sleigh · April 29, 2015, 10:37am

Thanks for the quick response!

I am currently using default values

GOSSIP INTERVAL MS: 1000 ()

GOSSIP ALLOWED DIFFERENCE MS: 60000 ()

GOSSIP TIMEOUT MS: 500 ()

The cluster is on the same subnet on virtual servers. On that basis I would expect the cluster to be reasonably stable. Whats the downside of extending these?

Greg_Young1 · April 29, 2015, 10:44am

500ms is quite short though especially on virtual servers where who
knows wtf may be interfering with things. Remember this is happening
once per second based on your config. so 100-150 / day is actually not
very many as a % (also remember each server is doing this once per
second so likely multiply by 3, you didn't state if it was
100/day/server or 100/day total).

The other thing to look at is the heartbeat timeouts on internal tcp
connections e.g. replication protocol. This can can also cause a
"think server is dead".

Changing these parameters in general though only affects in the case a
server is actually dead how long it takes to detect it and start
changing the cluster layout. In the case you gave it was a slave so it
didn't matter. For most systems whether this kind of automated
failover takes 600ms or 6000ms doesn't make much difference (clients
will retry anyways so it will just appear as a latency spike). If you
are very concerned about failover times the most important thing is
stable predictable runtime environments and connectivity. This rules
out most cloud systems and probably much virtualization (shared
resources get spiked by something else).