On several occasions, we have experienced downtime in our three node production cluster (version 4.1.0 on Windows Server 2012 R2 hosted in AWS) when the cluster appears to get into an election deadlock. As best as I can tell, the series of events appears to trigger when, during an election, more than one node becomes Master, and, after a brief period of time, the cluster seems to take itself offline, presumably because of the lack of consensus. Initially, I assumed this was due to a networking partitioning even of some sort, but that does not make a lot of sense because a single partitioned node should take itself out of the running for Master and the other two should be able to reach consensus without it. Additionally, I have yet to find evidence of network partitioning actually occurring. The only errors we receive in the EventStore logs are DNS related (our clusters use DNS discovery to find one another), however, each time this has occurred, our other DNS resources have appeared untouched so it does not appear to be a widespread DNS issue.
The specific error we see when this occurs is:
[PID:01732:007 2018.07.24 19:07:36.635 ERROR GossipServiceBase ] Error while retrieving cluster members through DNS.
System.Net.Sockets.SocketException (0x80004005): No such host is known
at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)
at System.Net.Dns.EndGetHostAddresses(IAsyncResult asyncResult)
at EventStore.Core.Services.Gossip.DnsGossipSeedSource.EndGetHostEndpoints(IAsyncResult asyncResult) in C:\projects\eventstore\src\EventStore.Core\Services\Gossip\DnsGossipSeedSource.cs:line 25
at EventStore.Core.Services.Gossip.GossipServiceBase.OnGotGossipSeedSources(IAsyncResult ar) in C:\projects\eventstore\src\EventStore.Core\Services\Gossip\GossipServiceBase.cs:line 94
The resolution ends up requiring full reboots of each node (sometimes multiple times) before the error clears and the nodes begin communicating once again. On each occasion I’ve seen this, one of the nodes has continued throwing this error (and continuing to attempt to step into the role of Master) until being rebooted multiple times - only then does he come up cleanly and rejoin the cluster.
My first instinct is there is some sort of DNS caching being done incorrectly or EventStore is losing its ability to see read the DNS record locally for a period of time.
Our config file looks like: