Time difference is too great on cluster

Timothee_Barray · June 14, 2017, 10:00am

Hello,

We are trying to run Eventstore as a cluster through Docker. Everything is fine (thanks for this amazing job !) but we get this kind of message after a few minutes :

[00001,14,09:54:23.463] Time difference between us and [10.0.0.2:2112] is too great! UTC now: 2017-06-14 09:54:23.463, peer’s time stamp: 2017-06-14 09:52:38.198.
[00001,14,09:54:26.433] Time difference between us and [10.0.0.2:2112] is too great! UTC now: 2017-06-14 09:54:26.433, peer’s time stamp: 2017-06-14 09:52:38.198.
[00001,14,09:54:29.437] Time difference between us and [10.0.0.2:2112] is too great! UTC now: 2017-06-14 09:54:29.437, peer’s time stamp: 2017-06-14 09:52:38.198.
[00001,14,09:54:30.437] Time difference between us and [10.0.0.2:2112] is too great! UTC now: 2017-06-14 09:54:30.437, peer’s time stamp: 2017-06-14 09:52:38.198.
[00001,14,09:54:36.442] Time difference between us and [10.0.0.2:2112] is too great! UTC now: 2017-06-14 09:54:36.442, peer’s time stamp: 2017-06-14 09:52:38.198.
[00001,14,09:54:38.445] Time difference between us and [10.0.0.2:2112] is too great! UTC now: 2017-06-14 09:54:38.445, peer’s time stamp: 2017-06-14 09:52:38.198.

In fact the peer’s time stamp looks “blocked” at the start time (it never changes). Not sure how eventstore get this timestamp and how we could mange to prevent having this log and how it could be a problem if we cannot fix it ?

We use eventstore 4.0

Greg_Young1 · June 14, 2017, 10:01am

What are the times on the nodes and are you using ntp etc to synchronize them?

Timothee_Barray · June 14, 2017, 12:11pm

Hello,

time seems ok on each node:

eventstore@5068631d22d5:/app$ date
Wed Jun 14 12:07:37 UTC 2017

eventstore@e76bcdb2fe51:/app$ date
Wed Jun 14 12:08:50 UTC 2017

And no currently we did not use ntp. Should we ?

Greg_Young1 · June 14, 2017, 12:14pm

I can't know how long it took you between running those commands LOL

There is a limit of 60 seconds by default in allowed time differences,
this is what the message is complaining about.

Timothee_Barray · June 14, 2017, 12:33pm

Yep I was thinking after a while that the both datetime could be useful

here is a simultaneous call

docker exec -it 5068631d22d52c279b1f9f34d0d47e2ead1a8953f3dffd2b87345b4e3bc8d81d date; docker exec -it e76bcdb2fe51f3080a9e5187b0d82116b28003fd348d742c47639f501540d27f date
Wed Jun 14 12:29:01 UTC 2017
Wed Jun 14 12:29:01 UTC 2017

Ok for 60s, good to know. But why the peer’s timestamp still the same even after more that 30min. Here is a screenshot will full log

http://screencloud.net/v/wttL7

And finally after a few hours, the message seems gone. But when I run a new node, the problem gets back.

Greg_Young1 · June 14, 2017, 12:37pm

So this is in gossip. The system time is taken from the machine when
sending a gossip message. If they are too far apart (by default 60
seconds) then it will ignore the gossip from that node for updating
cluster status and print this message.

There are two common reasons for this. The most obvious is the actual
clocks drifting (things like ntp help here), the second is it actually
took the message say 60 seconds to get delivered for varying reasons.

Greg

Timothee_Barray · June 14, 2017, 12:53pm

Ok make sense.

But look here
http://screencloud.net/v/DlvzD

The peer’s timestamp indicate “2017-06-14 12:33:01.559” whereas the machine clock is fine. I still don’t get why we still receive “2017-06-14 12:33:01.559”.

But if the problem is just at startup why not for now. Does this error prevent us to lose data ? I guess the slaves are not use until time becomes synchronized ?

Thanks for your time

Greg_Young1 · June 14, 2017, 12:58pm

No there is no data loss possible here, this is only the gossip
protocol of membership.

Thierry_Bouquain · October 4, 2017, 8:37am

Hello, I have exactly the same issue with a database cluster on docker. I have created a network with
docker network create -d overlay --attachable esnet

``

and then started a cluster with
docker service create --detach=false --replicas 3 --name es1-node --network name=esnet,alias=escluster.net -e EVENTSTORE_CLUSTER_SIZE=3 -e EVENTSTORE_CLUSTER_DNS=escluster.net eventstore/eventstore:release-4.0.3

``

and after one minute I have some warning in the logs.

[00001,13,08:32:23.355] Time difference between us and [10.0.1.2:2112] is too great! UTC now: 2017-10-04 08:32:23.355, peer’s time stamp: 2017-10-04 08:22:14.707.

``

The strange things is that here, this node is trying to compare it’s time with the manager node at IP 10.0.1.2 but there is no container started with this ip.

We can see in the logs that there is no manager

[00001,13,08:22:16.153] CLUSTER HAS CHANGED (gossip received from [10.0.1.5:2112])

[00001,13,08:22:16.153] Old:

[00001,13,08:22:16.153] VND {eb19f5a3-c485-4978-947d-e8b2e87f9fb9} [Master, 10.0.1.5:1112, n/a, 0.0.0.0:1113, n/a, 10.0.1.5:2112, 0.0.0.0:2113] 14464/28294/28294/E0@0:{3c19d316-ff57-46b4-9020-b3b673bb7f8d} | 2017-10-04 08:22:16.002

[00001,13,08:22:16.153] VND {32d6344b-7ed1-46fc-991a-c02e34747348} [Slave, 10.0.1.4:1112, n/a, 0.0.0.0:1113, n/a, 10.0.1.4:2112, 0.0.0.0:2113] -1/28294/167/E-1@-1:{00000000-0000-0000-0000-000000000000} | 2017-10-04 08:22:16.033

[00001,13,08:22:16.153] VND {53829a5a-de74-480c-b1c0-9cc55f8ffd95} [Slave, 10.0.1.3:1112, 10.0.1.3:0, 0.0.0.0:1113, 0.0.0.0:0, 10.0.1.3:2112, 0.0.0.0:2113] 167/28294/305/E0@0:{3c19d316-ff57-46b4-9020-b3b673bb7f8d} | 2017-10-04 08:22:16.114

[00001,13,08:22:16.153] MAN {00000000-0000-0000-0000-000000000000} [Manager, 10.0.1.2:2112, 10.0.1.2:2112] | 2017-10-04 08:22:14.707

[00001,13,08:22:16.153] New:

[00001,13,08:22:16.154] VND {eb19f5a3-c485-4978-947d-e8b2e87f9fb9} [Master, 10.0.1.5:1112, n/a, 0.0.0.0:1113, n/a, 10.0.1.5:2112, 0.0.0.0:2113] 14464/28294/28294/E0@0:{3c19d316-ff57-46b4-9020-b3b673bb7f8d} | 2017-10-04 08:22:16.126

[00001,13,08:22:16.154] VND {32d6344b-7ed1-46fc-991a-c02e34747348} [Slave, 10.0.1.4:1112, n/a, 0.0.0.0:1113, n/a, 10.0.1.4:2112, 0.0.0.0:2113] -1/28294/167/E0@0:{3c19d316-ff57-46b4-9020-b3b673bb7f8d} | 2017-10-04 08:22:16.061

[00001,13,08:22:16.154] VND {53829a5a-de74-480c-b1c0-9cc55f8ffd95} [Slave, 10.0.1.3:1112, 10.0.1.3:0, 0.0.0.0:1113, 0.0.0.0:0, 10.0.1.3:2112, 0.0.0.0:2113] 14464/28294/28294/E0@0:{3c19d316-ff57-46b4-9020-b3b673bb7f8d} | 2017-10-04 08:22:16.153

[00001,13,08:22:16.154] MAN {00000000-0000-0000-0000-000000000000} [Manager, 10.0.1.2:2112, 10.0.1.2:2112] | 2017-10-04 08:22:14.707

``

Do this means that we just have to ignore this “false positive” warning ? Can we turn this warning off when in database only cluster ?

Thanks

Hayley-Jean_Campbell · October 5, 2017, 6:48am

Hi Thierry,

What does the DNS record you are using for gossip discovery look like? Is there perhaps an extra entry for “10.0.1.2:2112”?

The default state of a node is shown as “Manager” if a node has been unable to gossip with it at all, so you might be getting this message because the cluster expects a node to be there even if it’s not.

Thierry_Bouquain · October 5, 2017, 9:16am

Hi

Ok, I found the point, when using the replicas feature of docker, the load balancing is done by default on this IP and there is only one entry in the dns. The service needs to be created in dns round robbin mode to support service scaling.

docker service create --detach=false --replicas 3 --name es1-node --network name=esnet,alias=escluster.net -e EVENTSTORE_CLUSTER_SIZE=3 -e EVENTSTORE_CLUSTER_DNS=escluster.net --endpoint-mode=dnsrr eventstore/eventstore:release-4.0.3

``

Thanks !