Heartbeat issues

Hi,

My 3 instance cluster has logs filled with this:

[PID:23007:024 2016.01.04 04:46:45.028 TRACE GossipServiceBase ]
Looks like node [172.31.33.229:2112] is DEAD (Gossip send failed).

[PID:23007:024 2016.01.04 04:46:45.028 TRACE GossipServiceBase ]
CLUSTER HAS CHANGED (gossip send failed to [172.31.33.229:2112])

[PID:23007:024 2016.01.04 04:46:45.028 TRACE GossipServiceBase ] Old:

[PID:23007:024 2016.01.04 04:46:45.028 TRACE GossipServiceBase ] VND
{a8077714-3620-4707-ac7b-e3f5b8104e84} <LIVE> [Slave,
172.31.36.140:1112, n/a, 127.0.0.1:1113, n/a, 172.31.36.140:2112,
127.0.0.1:2113]
479567741/479583243/479583243/E110@479567514:{89e31eed-2e28-4f81-a164-b2510f3fec87}

2016-01-04 04:47:02.000

[PID:23007:024 2016.01.04 04:46:45.028 TRACE GossipServiceBase ] VND
{3df72699-fd76-4272-88cb-4c8954370259} <LIVE> [Slave,
172.31.33.229:1112, n/a, 127.0.0.1:1113, n/a, 172.31.33.229:2112,
127.0.0.1:2113]
479567741/479583243/479583243/E110@479567514:{89e31eed-2e28-4f81-a164-b2510f3fec87}

2016-01-04 04:46:59.557

[PID:23007:024 2016.01.04 04:46:45.028 TRACE GossipServiceBase ] VND
{29f2b124-60bf-4ba1-957d-be84d62fb129} <LIVE> [Master,
172.31.33.228:1112, 172.31.33.228:0, 127.0.0.1:1113, 127.0.0.1:0,
172.31.33.228:2112, 127.0.0.1:2113]
479567741/479583243/479583243/E110@479567514:{89e31eed-2e28-4f81-a164-b2510f3fec87}

2016-01-04 04:46:44.830

[PID:23007:024 2016.01.04 04:46:45.031 TRACE GossipServiceBase ] New:

[PID:23007:024 2016.01.04 04:46:45.031 TRACE GossipServiceBase ] VND
{a8077714-3620-4707-ac7b-e3f5b8104e84} <LIVE> [Slave,
172.31.36.140:1112, n/a, 127.0.0.1:1113, n/a, 172.31.36.140:2112,
127.0.0.1:2113]
479567741/479583243/479583243/E110@479567514:{89e31eed-2e28-4f81-a164-b2510f3fec87}

2016-01-04 04:47:02.000

[PID:23007:024 2016.01.04 04:46:45.031 TRACE GossipServiceBase ] VND
{3df72699-fd76-4272-88cb-4c8954370259} <DEAD> [Slave,
172.31.33.229:1112, n/a, 127.0.0.1:1113, n/a, 172.31.33.229:2112,
127.0.0.1:2113]
479567741/479583243/479583243/E110@479567514:{89e31eed-2e28-4f81-a164-b2510f3fec87}

2016-01-04 04:46:45.028

[PID:23007:024 2016.01.04 04:46:45.031 TRACE GossipServiceBase ] VND
{29f2b124-60bf-4ba1-957d-be84d62fb129} <LIVE> [Master,
172.31.33.228:1112, 172.31.33.228:0, 127.0.0.1:1113, 127.0.0.1:0,
172.31.33.228:2112, 127.0.0.1:2113]
479567741/479583243/479583243/E110@479567514:{89e31eed-2e28-4f81-a164-b2510f3fec87}

2016-01-04 04:46:44.830

[PID:23007:024 2016.01.04 04:46:45.031 TRACE GossipServiceBase ]

[PID:23007:024 2016.01.04 04:46:45.028 TRACE GossipServiceBase ]
CLUSTER HAS CHANGED (gossip send failed to [172.31.33.229:2112])

This looks more like a node is unreachable from one node but
accessible from another. What is your setup/config?

The timeouts you want are most likely the gossip timeouts btw. There
are also internal tcp timeouts but they give a different message.

Greg

3 m3.medium EC2 instances, all in the same security group, with
unlimited internal traffic. Here's my eventstore.conf file from one of
the servers:

No you will get a specific issue on time drift. We need all 3 of the
configs to say anything meaningful.

This is a bit weird: 127.0.0.1 as External IP?

No you will get a specific issue on time drift. We need all 3 of the
configs to say anything meaningful.

This is a bit weird: 127.0.0.1 as External IP?

Yeah. I have 3 servers, each with EventStore plus my app. All app
instances talk to the local instance only, for read and write. To get
that to work the app config is that it connects to "localhost", which
means that external IP is 127.0.0.1. Is it possible to set it to
0.0.0.0 as an alternative?

cheers, Rickard