Ghost "manager" nodes showing up in cluster overview

Kristian_Freed · November 19, 2017, 5:48pm

I’ve recently made some changes to our EventStore cluster resulting in the IP address advertised for each node being different from the IP address that peer nodes would see on connect.

It appears that as a result, the EventStore cluster registers each node twice, one under the IP it has advertised, and one which is the IP address that has connected. The duplicate record then shows up as a “Manager” node in the cluster overview, despite the fact that I’m running the OSS version and there are no manager nodes.

Should I be worried about this at all or does it just look funny?
Is there a way to make it so that these entries do not appear?

Cheers,

Kristian

Greg_Young1 · November 19, 2017, 5:51pm

Should I be worried about this at all or does it just look funny?

Due to the nature of the protocol they should just happily disappear shortly. This information is spread over an endemic protocol and should eventually just time out.

Is there a way to make it so that these entries do not appear?

Need to look more into it possibly.

Kristian_Freed · November 19, 2017, 5:59pm

They do appear to go away after some time (though much more than a few minutes). Until they do, I’m getting a lot of this in the logs:

[00001,17,17:56:15.574] Time difference between us and [x.x.x.x:2112] is too great! UTC now: 2017-11-19 17:56:15.574, peer’s time stamp: 2017-11-19 17:33:07.104.

[00001,17,17:56:16.576] Time difference between us and [x.x.x.x:2112] is too great! UTC now: 2017-11-19 17:56:16.576, peer’s time stamp: 2017-11-19 17:33:06.324.

[00001,17,17:56:20.582] Time difference between us and [x.x.x.x:2112] is too great! UTC now: 2017-11-19 17:56:20.582, peer’s time stamp: 2017-11-19 17:33:06.324.

[00001,17,17:56:21.583] Time difference between us and [x.x.x.x:2112] is too great! UTC now: 2017-11-19 17:56:21.583, peer’s time stamp: 2017-11-19 17:33:07.104.

Greg_Young1 · November 19, 2017, 6:10pm

That can only happen if they do communicate as far as I know. How are you setting your clocks?

Kristian_Freed · November 19, 2017, 7:06pm

The timestamp is from when the node last connected. The node is reachable both on the “direct” IP (e.g. the IP that other nodes will see as the connecting one), and on the advertised one.

What appears to happen is:

Node A discovers peers and connects to one of them
Master sees a new connection and registers a new member. For whatever reason, this member is flagged as a Manager, let’s call it A1
Node A tells the rest about itself using the advertised IP address and gets picked up as a slave node, we now have it registered as A2

Node A is reachable both under the IP of A1 and A2 - but the cluster does not think they’re the same. Only one of them becomes a slave. A1 eventually gets kicked out from the cluster.

Hayley-Jean_Campbell · November 20, 2017, 7:08am

Hi Kristian,

Could you provide us with a sample of the configuration you are using?

What IP address are you using for your gossip seeds? Are you using the node’s actual address or the advertised address?

Kristian_Freed · November 23, 2017, 1:13pm

Hi,

Seeding is done via DNS, the DNS record resolves to the “actual” IPs rather than the advertised.

The incoming connections between nodes would appear to come from the “actual” IPs, and each node is reachable on both the actual, and the advertised IPs.

For communication between nodes, we don’t particularly care which IP is being used, but I want the advertised one to appear so that native clients will connect directly to the advertised IP, rather than the actual.

It all works, and eventually the ghost entries disappear, but it looks odd in the UI for some time.

Cheers,

Kristian

Greg_Young1 · November 23, 2017, 1:24pm

“It all works, and eventually the ghost entries disappear, but it looks odd in the UI for some time.”

So ES uses an endemic model (gossip) for cluster membership. We should look at exposing the timeout of a known node as an option. I am guessing it being there for 5-15 minutes would not be an issue?

Kristian_Freed · November 23, 2017, 5:48pm

TBH, having them appear in the list in the UI is not an issue per se, I just wanted to confirm that this won’t cause any unwanted behaviour.

Greg_Young1 · November 23, 2017, 5:56pm

Nope they timeout after a while. You will notice they will show up as dead very quickly (depends on your gossip settings).