Dead Nodes. Gossip Send failing.

Stuart_King · September 16, 2015, 6:39am

I have a cluster of three nodes running on 3 vagrant instances.

The nodes start and use DNS discovery to locate the IP addresses of the other nodes. All the nodes report as LIVE initially, then start to report as DEAD. Gossip send is failing but it’s unclear why.

I have attached the logs for the first node here https://gist.github.com/stuartrexking/e23d9bd96b1f457c4be9. The logs are the same for each node as is the command to start each.

Thanks in advance.

Stu

pieter.germishuys · September 16, 2015, 6:48am

The gossip port by default is set to 30777, you need to change that to the external http port (default is 2113).

Stuart_King · September 16, 2015, 6:55am

Can you explain why I would need to do that? Everything else is default.

I made the change as you suggest and I get the same result:

$ sudo eventstored --mem-db -log /var/log/eventstore.log --int-ip 172.20.20.10 --ext-ip 172.20.20.10 --cluster-size=3 --cluster-dns eventstore.service.consul --cluster-gossip-port=2113

logs: https://gist.github.com/stuartrexking/1ec7fa85bf5293a66564

Greg_Young1 · September 16, 2015, 6:58am

from node 172.20.20.10

can your curl http://172.20.20.11:2113/gossip ?

Basically your nodes are not talking to each other (they are not
"working then not working" they are just not talking to each other)

pieter.germishuys · September 16, 2015, 7:04am

My apologies, you need to set the Cluster Gossip Port to the Internal Http Port (Default: 2112)
So what happens is that there needs to be some chatting that happens amongst the nodes, they “gossip” over http, which means that we need an ip and a port.

By default the Cluster Gossip Port is 30777 (Which is the manager port, you don’t have a manager in your setup, which means that we need to specify the port that the nodes will know to gossip over)

Stuart_King · September 16, 2015, 7:04am

Yes. I can https://gist.github.com/stuartrexking/8bf22720ac81e8a637c7, although the nodes are reported as isAlive false.

Stu

Stuart_King · September 16, 2015, 7:06am

Thanks Pieter. That makes sense and solves the issue.