Cluster Time difference error

We’re seeing a large number of errors on our cluster nodes like:

2016-03-18/[PID:16535:026 2016.03.18 20:49:41.641 ERROR GossipServiceBase ] Time difference between us and [] is too great! UTC now: 2016-03-18 20:49:41.641, peer’s time stamp: 2016-03-18 20:48:26.765.

However, the nodes themselves are within milliseconds of each other, and the error only occurs if the time stamps are out by a minute or more.

The nodes are virtual (on VMware) running Ubuntu 14.04 LTS, and the NTP synchronisation between them confirms that the time of the nodes is in agreement.

It appears that the comparison of the timestamps is done after processing of the gossip messages. What is the processing path that occurs between receiving the gossip messages and the actual comparison?

The linux VMs are not reporting CPU issues and networking is not showing any problem. Where should we be looking for the performance bottleneck?



There shouldn't be any performance bottleneck causing that (its all in
memory processing for gossip messages).

It also certainly should not cause this big of a difference:

2016-03-18 20:49:41.641, peer's time stamp: 2016-03-18 20:48:26.765

The place where something could be happening is on the send/receive
taking a while but a minute doesn't seem very likely.

If you bring up two instances curl http://yourhost:httpport/gossip and
run them at about the same time what do they come up with?

Also is it always the same nodes getting error messages?

Also can you send a log from around this time? Are these nodes under load?

If I curl the gossip port of node2 and node3 (from node1), they are identical bar millisecond differences in the timestamps from the delay in the two curl statements.

We have seen the error message on all three of the nodes at different times.

Although they are in use, there isn’t any suggestion that they are under stress according to the system stats. However, they are hosted on a vmware cluster, so there might be resource issues at that point.

I’ve attached logs for the day before and the day in question - this particular instance happens just after midnight, but some possible errors occurred around 23:45 on one of the nodes. The node that was having problems eventually got into a loop and memory usage climbed until the node died (although the process is still there, it no longer responds to gossip). Restarting the process brings everything back into line.

This is our dev cluster, but we have seen the same thing happen in production.



combined_logs.tgz (4.99 MB)

Joel, how many esx(i) hosts do you have, and are the development ES nodes pinned to a particular esx(i) host?

Furthermore, is any particular activity happening “underneath” at that point, e.g. backup of vmware machines?

Weird I don't have Joels email

Don’t have visibility of this at the moment, but have asked the question of the infrastructure team who run the ESX platform.

To be clear:

"We have seen the error message on all three of the nodes at different times."

One node having a clock issue will put messages on all the nodes (it
is gossiping with them) the real question is if its one node with a
weird clock or multiple. You can find this by looking in the logs, the
logs say what their time is and which node they were gossiping with
and what time they were given.



Again, don’t have visibility, but we have seen the problem occurring at various times of the day, so not going to be something as simple as backups - sometimes it happens, sometimes it doesn’t.

5 hosts in the ESX cluster (for each environment), no pinning.

I'm not that familiar with ESX but is it possible a node is getting
moved etc and the clock is actually off for a period of time? If this
happens with some frequency it should be possible to write a little
script that gets the clock and checks. There isn't any "fancy" code
around gossip it just takes the clock and sends it. An alternative
which I may add any way is to have gossip track what its expected time
is and if that jumps to start logging about it.

When I have seen the error message on all three nodes, it is for separate errors, and different hosts reporting as being out.

The nodes are VMs, so not having a real clock (and in fact are on the same ESX cluster) - the NTP stats show that the clocks are in sync within milliseconds, certainly nowhere near a minute difference. We have also seen it in each environment, so same symptoms across 3 different clusters on 3 different ESX clusters. I don’t think the issue is clock related, but something around performance and the way that the processing path happens to do the comparison against the gossip timestamp.

We aren’t seeing performance issues within the Linux metrics, and the idle time percentage on main queue and storage writer queue is nearly always 100%.

The NTP stats would pick this up - we graph every minute, and the offset is in the order of milliseconds, not seconds, and certainly not minutes.

It isn’t a true time difference, but seems to be a delay in the way the gossips are processed.

"I don't think the issue is clock related, but something around
performance and the way that the processing path happens to do the
comparison against the gossip timestamp."

What I am suggesting is that due to something that occurs the clock
shifts and likely gets corrected after. An example of this could be
moving a node.

The path is all in memory. I would be amazed if you were seeing 2
minutes of a performance issue in handling a gossip message.

Can you provide log messages from during such an incident from all 3
nodes? Also your checks against ntp as far as I have seen here are
when the messages are *not* happening could you confirm this?

How long do the messages last?

If you look at the log archive I posted, it includes logs from 2 days for all three nodes, before the event and during.

The NTP stats are continuously collected and graphed (that was our first assumption, that it was a genuine clock issue).

I am not seeing this email?

Give me about an hour. I can't open it on my mac so will open it when
I get back to office.