We’re seeing a large number of errors on our cluster nodes like:
2016-03-18/192.168.147.211-2114-cluster-node.log:[PID:16535:026 2016.03.18 20:49:41.641 ERROR GossipServiceBase ] Time difference between us and [192.168.147.213:2113] is too great! UTC now: 2016-03-18 20:49:41.641, peer’s time stamp: 2016-03-18 20:48:26.765.
However, the nodes themselves are within milliseconds of each other, and the error only occurs if the time stamps are out by a minute or more.
The nodes are virtual (on VMware) running Ubuntu 14.04 LTS, and the NTP synchronisation between them confirms that the time of the nodes is in agreement.
It appears that the comparison of the timestamps is done after processing of the gossip messages. What is the processing path that occurs between receiving the gossip messages and the actual comparison?
The linux VMs are not reporting CPU issues and networking is not showing any problem. Where should we be looking for the performance bottleneck?
Cheers,
Joel