Heartbeat issues

Rickard_Oberg · January 4, 2016, 5:01am

Hi,

My 3 instance cluster has logs filled with this:

[PID:23007:024 2016.01.04 04:46:45.028 TRACE GossipServiceBase ]
Looks like node [172.31.33.229:2112] is DEAD (Gossip send failed).

[PID:23007:024 2016.01.04 04:46:45.028 TRACE GossipServiceBase ]
CLUSTER HAS CHANGED (gossip send failed to [172.31.33.229:2112])

[PID:23007:024 2016.01.04 04:46:45.028 TRACE GossipServiceBase ] Old:

[PID:23007:024 2016.01.04 04:46:45.028 TRACE GossipServiceBase ] VND
{a8077714-3620-4707-ac7b-e3f5b8104e84} <LIVE> [Slave,
172.31.36.140:1112, n/a, 127.0.0.1:1113, n/a, 172.31.36.140:2112,
127.0.0.1:2113]
479567741/479583243/479583243/E110@479567514:{89e31eed-2e28-4f81-a164-b2510f3fec87}

2016-01-04 04:47:02.000

[PID:23007:024 2016.01.04 04:46:45.028 TRACE GossipServiceBase ] VND
{3df72699-fd76-4272-88cb-4c8954370259} <LIVE> [Slave,
172.31.33.229:1112, n/a, 127.0.0.1:1113, n/a, 172.31.33.229:2112,
127.0.0.1:2113]
479567741/479583243/479583243/E110@479567514:{89e31eed-2e28-4f81-a164-b2510f3fec87}

2016-01-04 04:46:59.557

[PID:23007:024 2016.01.04 04:46:45.028 TRACE GossipServiceBase ] VND
{29f2b124-60bf-4ba1-957d-be84d62fb129} <LIVE> [Master,
172.31.33.228:1112, 172.31.33.228:0, 127.0.0.1:1113, 127.0.0.1:0,
172.31.33.228:2112, 127.0.0.1:2113]
479567741/479583243/479583243/E110@479567514:{89e31eed-2e28-4f81-a164-b2510f3fec87}

2016-01-04 04:46:44.830

[PID:23007:024 2016.01.04 04:46:45.031 TRACE GossipServiceBase ] New:

[PID:23007:024 2016.01.04 04:46:45.031 TRACE GossipServiceBase ] VND
{a8077714-3620-4707-ac7b-e3f5b8104e84} <LIVE> [Slave,
172.31.36.140:1112, n/a, 127.0.0.1:1113, n/a, 172.31.36.140:2112,
127.0.0.1:2113]
479567741/479583243/479583243/E110@479567514:{89e31eed-2e28-4f81-a164-b2510f3fec87}

2016-01-04 04:47:02.000

[PID:23007:024 2016.01.04 04:46:45.031 TRACE GossipServiceBase ] VND
{3df72699-fd76-4272-88cb-4c8954370259} <DEAD> [Slave,
172.31.33.229:1112, n/a, 127.0.0.1:1113, n/a, 172.31.33.229:2112,
127.0.0.1:2113]
479567741/479583243/479583243/E110@479567514:{89e31eed-2e28-4f81-a164-b2510f3fec87}

2016-01-04 04:46:45.028

[PID:23007:024 2016.01.04 04:46:45.031 TRACE GossipServiceBase ] VND
{29f2b124-60bf-4ba1-957d-be84d62fb129} <LIVE> [Master,
172.31.33.228:1112, 172.31.33.228:0, 127.0.0.1:1113, 127.0.0.1:0,
172.31.33.228:2112, 127.0.0.1:2113]
479567741/479583243/479583243/E110@479567514:{89e31eed-2e28-4f81-a164-b2510f3fec87}

2016-01-04 04:46:44.830

[PID:23007:024 2016.01.04 04:46:45.031 TRACE GossipServiceBase ]

Greg_Young1 · January 4, 2016, 5:18am

[PID:23007:024 2016.01.04 04:46:45.028 TRACE GossipServiceBase ]
CLUSTER HAS CHANGED (gossip send failed to [172.31.33.229:2112])

This looks more like a node is unreachable from one node but
accessible from another. What is your setup/config?

The timeouts you want are most likely the gossip timeouts btw. There
are also internal tcp timeouts but they give a different message.

Greg

Rickard_Oberg · January 4, 2016, 5:29am

3 m3.medium EC2 instances, all in the same security group, with
unlimited internal traffic. Here's my eventstore.conf file from one of
the servers:

Greg_Young1 · January 4, 2016, 8:53am

No you will get a specific issue on time drift. We need all 3 of the
configs to say anything meaningful.

This is a bit weird: 127.0.0.1 as External IP?

Rickard_Oberg · January 4, 2016, 9:16am

No you will get a specific issue on time drift. We need all 3 of the
configs to say anything meaningful.

This is a bit weird: 127.0.0.1 as External IP?

Yeah. I have 3 servers, each with EventStore plus my app. All app
instances talk to the local instance only, for read and write. To get
that to work the app config is that it connects to "localhost", which
means that external IP is 127.0.0.1. Is it possible to set it to
0.0.0.0 as an alternative?

cheers, Rickard