Cluster issue after network segment

Ryan_Anthony · December 12, 2014, 6:30pm

I have three VMs (Windows) each in a different region in Azure. (East US, West US, Central US)

They are running 3.0.1 of Event Store.

The regions are connected by a VNET to VNET VPN connections.

On the Central US node TestClient’s WRFL command has been running against the cluster in a loop every 60 second.

After having an issue with the connection between East US and West US (VPN gateway disconnected and wouldn’t reconnect), the cluster landed in this state:

According to East US (/gossip)

East US is Master
Central US is Slave
West US isn’t on the list

According to Central US (/gossip)

East US is Master
Central US is Slave
West US isn’t on the list

According to West US (/gossip)

West US is Slave
East US isn’t on the list
Central US isn’t on the list

West US’s logs show that it is in a loop trying to elect a leader…

I am able to verify that all servers can communicate (at least enough to get a /gossip.json response)

It looks like East US was restarted during the network segment (and while the cluster was split).

I have collected the log/config files as well. (3x ~15MB zip files)

I have collected the log file of the looped run of WRFL.

I have verified that the DNS entry shows all three nodes from all three of the boxes.

The cluster is still in this state. I can leave it as is for a while. It is a dev/test environment.

Any ideas on what is going on?

Thanks, Ryan

Greg_Young1 · December 13, 2014, 5:30pm

Is West able to gossip with the other nodes? It sounds like it can't
talk to them

Ryan_Anthony · December 13, 2014, 10:41pm

From that box I was able to download the /gossip json file of the other two servers from the West US server.

jen20 · December 15, 2014, 3:49pm

Are they sending the correct host headers with the requests when posting gossip? I’ve heard of such things before. In general you don’t want to be running clusters between regions anyway as replication is synchronous and latency will suffer considerably.