Cluster Time difference error

Joel_Smith · April 1, 2016, 10:49am

Sure - should just be a tgz file…

tar -xzf combined_file.tgz

``

should open it on a mac

Joel_Smith · April 1, 2016, 10:51am

Thanks for your assistance…

Joel_Smith · April 1, 2016, 10:56am

Oddly enough, when we get this issue, restarting the eventstore process brings the nodes back into sync with no complaints about time. It’s as if the eventstore builds up a backlog of gossip messages, but my understanding was that gossip messages were discarded if they took too long to process.

Greg_Young1 · April 1, 2016, 11:05am

They are.

I am noticing there are a whole bunch of issues happening around
midnight (only looked at .212 so far). Heartbeat timeouts, elections
data not being able to be sent etc.

There is a time difference in the middle of it.

Much of this log is showing pretty severe network issues (cause
unknown at this point) as example you have run through 57,000 election
cycles.

On the gossip I think you are right that they are being queued. I just
audited the code and the timeout removes the task but does not cancel
the actual operation so it could just be very slowly trying (and when
they do come through they come through every few ms)

Joel_Smith · April 1, 2016, 12:24pm

On 17/3 at 23:46 node 211 generates an error log with VERY SLOW BUS MSG alerts. This is just after
2016.03.17 23:44:46.952 DEBUG ClusterVNodeControll] There is NO MASTER or MASTER is DEAD according to GOSSIP. Starting new elections.

``

node 212 had been master.

This was when we got our first monitoring alerts:

checking gossip at 192.168.147.212:2113

Checking for 3 nodes
Checking nodes for IsAlive state
CheckGossip CRITICAL: Only 2 alive nodes, should be 3 alive

Host: ggh-d-evtapp02
Timestamp: 2016-03-17 23:46:29 +0000
Address: 192.168.147.212

``

checking gossip at 192.168.147.213:2113
Checking for 3 nodes
Checking nodes for IsAlive state

CheckGossip CRITICAL: Only 2 alive nodes, should be 3 alive
Host: ggh-d-evtapp03Timestamp: 2016-03-17 23:46:29 +0000
Address: 192.168.147.213

``

checking gossip at 192.168.147.211:2113

Checking for 3 nodes
Checking nodes for IsAlive state
CheckGossip CRITICAL: Only 1 alive nodes, should be 3 alive

Host: ggh-d-evtapp01

Timestamp: 2016-03-17 23:46:29 +0000
Address: 192.168.147.211

``

So, node 213 and 212 see 2 nodes, node 211 sees one.

All recovers at the next check at 23:47:29

Joel_Smith · April 1, 2016, 12:53pm

Next alerts:

nodes found with states:
PreReplica when expected Master or Slave.

checking gossip at 192.168.147.212:2113

Checking for 3 nodes

Checking nodes for IsAlive state

Checking for exactly 1 master

Checking node state

Host: ggh-d-evtapp02

Timestamp: 2016-03-18 00:01:29 +0000

Address: 192.168.147.212

Status: WARNING

``

nodes found with states:
PreReplica when expected Master or Slave.
checking gossip at 192.168.147.213:2113
Checking for 3 nodes
Checking nodes for IsAlive state
Checking for exactly 1 master
Checking node state

Host: ggh-d-evtapp03
Timestamp: 2016-03-18 00:01:29 +0000
Address: 192.168.147.213
Status: WARNING

``

checking gossip at 192.168.147.211:2113
CheckGossip CRITICAL: Could not connect to http://192.168.147.211:2113/gossip?format=xml to check gossip, has event store fallen over on this node?

Host: ggh-d-evtapp01.dev.green.net
Timestamp: 2016-03-18 00:01:29 +0000
Address: 192.168.147.211
Status: CRITICAL

``

Recovers at 00:10:29

Repeats at 00:11:29

Recovers at 00:38:20

Repeats at 01:17:29

Continues flapping thereafter until 20:01 where node 211 no longer recovers, and again gets into a Time difference error state. This continues until 20:42 when it reports:

[PID:16535:017 2016.03.18 20:42:14.809 ERROR GossipController ] Received as POST invalid ClusterInfo from [http://192.168.147.211:2113/gossip]. Content-Type: application/json.
[PID:16535:017 2016.03.18 20:42:14.809 ERROR GossipController ] Received as POST invalid ClusterInfo from [http://192.168.147.211:2113/gossip]. Body: .

``

This continues until

[PID:16535:007 2016.03.18 20:47:22.051 ERROR JsonCodec ]

``

During this time, memory use was ballooning, and then

[PID:16535:018 2016.03.18 22:21:00.433 DEBUG MonitoringService ] Could not get free mem on linux, received memory info raw string: []

``

and it was completely fritzed.

Greg_Young1 · April 1, 2016, 12:57pm

So there are two issues here. The first is that the old gossips aren't
really cancelling and come through later. This is relatively straight
forward to deal with at this point.
https://github.com/EventStore/EventStore/blob/release-v3.6.0/src/EventStore.Transport.Http/Client/HttpAsyncClient.cs#L19
you might enjoy the comment. Most likely it was failing on the create,
blocking. There is now a compatible license and we can use the other
code listed. I will put up an issue.

The second issue is what caused everything to get so backed up network
wise. On .212 it looks like all networking is having trouble in
particular with .211 you can see it trying to come back losing
connections etc. .211 doesn't see anyone else (and is getting problems
on all forms of networking tcp and http).

EG:

[PID:19643:021 2016.03.18 00:08:29.503 INFO TcpService ]
Internal TCP connection accepted: [Normal, 192.168.147.211:57344,
L192.168.147.212:1111, {d06ec4e7-e6b4-407a-b224-fc46286112f8}].
[PID:19643:024 2016.03.18 00:08:32.504 TRACE TcpConnectionManager]
Closing connection 'internal-normal' [192.168.147.211:57344,
L192.168.147.212:1111, {d06ec4e7-e6b4-407a-b224-fc46286112f8}]
cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 0
[PID:19643:024 2016.03.18 00:08:32.504 INFO TcpConnection ] ES
TcpConnection closed [00:08:32.505: N192.168.147.211:57344,
L192.168.147.212:1111,
{d06ec4e7-e6b4-407a-b224-fc46286112f8}]:Received bytes: 0, Sent bytes:
22
[PID:19643:024 2016.03.18 00:08:32.505 INFO TcpConnection ] ES
TcpConnection closed [00:08:32.505: N192.168.147.211:57344,
L192.168.147.212:1111, {d06ec4e7-e6b4-407a-b224-fc46286112f8}]:Send
calls: 1, callbacks: 1
[PID:19643:024 2016.03.18 00:08:32.505 INFO TcpConnection ] ES
TcpConnection closed [00:08:32.505: N192.168.147.211:57344,
L192.168.147.212:1111, {d06ec4e7-e6b4-407a-b224-fc46286112f8}]:Receive
calls: 1, callbacks: 0
[PID:19643:024 2016.03.18 00:08:32.506 INFO TcpConnection ] ES
TcpConnection closed [00:08:32.506: N192.168.147.211:57344,
L192.168.147.212:1111, {d06ec4e7-e6b4-407a-b224-fc46286112f8}]:Close
reason: [Success] HEARTBEAT TIMEOUT at msgNum 0

.211

[PID:16535:026 2016.03.18 00:02:37.202 DEBUG ElectionsService ]
ELECTIONS: (V=55714) TIMED OUT! (S=ElectingLeader, M=).
[PID:16535:026 2016.03.18 00:02:37.202 DEBUG ElectionsService ]
ELECTIONS: (V=55715) SHIFT TO LEADER ELECTION.
[PID:16535:026 2016.03.18 00:02:37.202 DEBUG ElectionsService ]
ELECTIONS: (V=55715) VIEWCHANGE FROM [192.168.147.211:2113,
{12c20f0a-fddd-41a9-abf2-25b17eac9ef0}].
[PID:16535:026 2016.03.18 00:02:37.607 INFO ReplicaService ]
Subscribing at LogPosition: 4680637529 (0x116FCDC59) to MASTER
[192.168.147.212:1111, {ca3cda3c-7619-4b57-9bbe-8151ea9e8082}] as
replica with SubscriptionId: {c40c4ee6-bdf3-425d-928c-fc6294b3e139},
ConnectionId: {a92bcebd-2269-47ea-a8ae-bb89a558e3cc}, LocalEndPoint:
[], Epochs:
E2483@4680637299:{d40f40f4-7b50-4cb6-9c22-c87382698475}

Greg_Young1 · April 1, 2016, 1:08pm

During this time, memory use was ballooning, and then

[PID:16535:018 2016.03.18 22:21:00.433 DEBUG MonitoringService ]
Could not get free mem on linux, received memory info rawstring: []

This I think might be due to the queueing of those messages I will
audit that code more closely.

What's also interesting is I have only seen anything like this a few
times (only in azure) I wonder if there is something particular about
how the virtualized networks are getting partitioned in these
environments as opposed to others.

I'm also am going to add monitoring to statistics as well as a
histogram there to get some deeper information.

Joel_Smith · April 1, 2016, 1:14pm

We are going to try putting all three nodes as a group so that they live on the same ESX host (and so eliminate the networking issue).

This has the downside of meaning that if that ESX host fails, we lose the whole cluster. However, in Dev at least we can do this to eliminate networking as part of the problem.

Joel_Smith · April 1, 2016, 2:41pm

The networking issue doesn’t seem to fit - We’ve not had any tx or rx errors:

uptime

15:15:19 up 150 days, 4:16, 2 users, load average: 0.81, 0.89, 0.68

/opt/sensu/embedded/bin/metrics-net.rb

ggh-d-evtapp01.net.eth0.tx_packets 358948797 1459520042

ggh-d-evtapp01.net.eth0.rx_packets 409045650 1459520042
ggh-d-evtapp01.net.eth0.tx_bytes 117129928809 1459520042
ggh-d-evtapp01.net.eth0.rx_bytes 125511090261 1459520042
ggh-d-evtapp01.net.eth0.tx_errors 0 1459520042
ggh-d-evtapp01.net.eth0.rx_errors 0 1459520042
ggh-d-evtapp01.net.eth0.if_speed 10000 1459520042

``

These are counters, so that is no tx_errors or rx_errors in 150 days. There is no firewall between the nodes.

Indeed, this seems to be the clincher (for 211):

[root@ggh-d-evtapp01:2016-03-31]# netstat -s

Ip:

375175628 total packets received

0 forwarded

0 incoming packets discarded

375162614 incoming packets delivered

359260474 requests sent out

5 outgoing packets dropped

Icmp:

31625 ICMP messages received

116 input ICMP message failed.

ICMP input histogram:

destination unreachable: 2382

echo requests: 29243

30821 ICMP messages sent

0 ICMP messages failed

ICMP output histogram:

destination unreachable: 1586

echo replies: 29235

IcmpMsg:

InType3: 2382

InType8: 29243

OutType0: 29235

OutType3: 1586

Tcp:

1313017 active connections openings

748250 passive connection openings

138137 failed connection attempts

374700 connection resets received

21 connections established

353046858 segments received

362162781 segments send out

2510208 segments retransmited

527276 bad segments received.

996872 resets sent

Udp:

7202637 packets received

249 packets to unknown port received.

0 packet receive errors

7207436 packets sent

UdpLite:

TcpExt:

166485 invalid SYN cookies received

120761 resets received for embryonic SYN_RECV sockets

712704 TCP sockets finished time wait in fast timer

11972772 delayed acks sent

4321 delayed acks further delayed because of locked socket

Quick ack mode was activated 290061 times

4584990 times the listen queue of a socket overflowed

4584990 SYNs to LISTEN sockets dropped

3738 packets directly queued to recvmsg prequeue.

333960446 bytes directly in process context from backlog

3023346 bytes directly received in process context from prequeue

108203159 packet headers predicted

480190 packets header predicted and directly queued to user

41099609 acknowledgments not containing data payload received

108090553 predicted acknowledgments

304 times recovered from packet loss by selective acknowledgements

56 congestion windows recovered without slow start by DSACK

3707 congestion windows recovered without slow start after partial ack

TCPLostRetransmit: 26

51 timeouts after SACK recovery

3 timeouts in loss state

1558 fast retransmits

295 forward retransmits

16805 retransmits in slow start

1672168 other TCP timeouts

TCPLossProbes: 355970

TCPLossProbeRecovery: 212066

16 SACK retransmits failed

290101 DSACKs sent for old packets

1 DSACKs sent for out of order packets

241161 DSACKs received

1 DSACKs for out of order packets received

54257 connections reset due to unexpected data

202077 connections reset due to early user close

15591 connections aborted due to timeout

TCPDSACKIgnoredNoUndo: 159022

TCPSpuriousRTOs: 182

TCPSackShifted: 643

TCPSackMerged: 1418

TCPSackShiftFallback: 6697

TCPRetransFail: 39

TCPRcvCoalesce: 44012879

TCPOFOQueue: 1064

TCPOFOMerge: 1

TCPChallengeACK: 536811

TCPSYNChallenge: 535124

TCPSpuriousRtxHostQueues: 12215

IpExt:

InNoRoutes: 2

InMcastPkts: 739

InBcastPkts: 14880564

InOctets: -1302133305

OutOctets: 1201931353

InMcastOctets: 26604

InBcastOctets: 1679428203

InNoECTPkts: 374870674

InECT0Pkts: 304957

``

In particular the lines:

4584990 times the listen queue of a socket overflowed

4584990 SYNs to LISTEN sockets dropped

``

This suggests that the problem is that the process is not able to accept the

network traffic sufficiently fast and it is being dropped at this point.

On 212:

3266737 times the listen queue of a socket overflowed

3266737 SYNs to LISTEN sockets dropped

``

and on 213:

736787 times the listen queue of a socket overflowed

736787 SYNs to LISTEN sockets dropped

``

On my rabbitmq servers, which have similar uptimes, I don’t see any drops to LISTEN sockets. In fact, a quick grep across the whole of the dev linux estate, doesn’t show this on anything other than the eventstore servers. I suspect the issue is in the way the network traffic is being dealt with inside the process, rather than being the networking itself.

Greg_Young1 · April 1, 2016, 3:00pm

This is listen sockets and listen queue (eg accepting connections).
This would not explain TCP connections all dying at the same time.

How long has your ES nodes been running (estimate on the machine)? And
what are your uptimes?

Greg_Young1 · April 1, 2016, 3:12pm

btw for listen sockets I fixed a bug in mono a couple weeks ago that
will be bundled I think with the next version of ES where it would
stop accepting all sockets (http) until the process was restarted.
Normally I could only make this happen by sending connect floods to
the http server. Have you seen in the runtime of the servers a time
where all http sockets were being refused?

Joel_Smith · April 1, 2016, 3:16pm

This is listen sockets and listen queue (eg accepting connections).

This would not explain TCP connections all dying at the same time.

But if it is not able to deal with the new connections, is it also possible that it is not able to process existing connections expeditiously? Then, after it is dropped it due to a heartbeat timeout, it would not be able to re-establish the connection?

How long has your ES nodes been running (estimate on the machine)? And

what are your uptimes?

211# uptime
15:15:19 up 150 days, 4:16, 2 users, load average: 0.81, 0.89, 0.68

eventst+ 9556 16.7 13.4 1736832 545916 ? Ssl Mar20 2838:36 /usr/bin/eventstored

212# uptime

16:08:21 up 151 days, 4:55, 1 user, load average: 0.18, 0.31, 0.27
eventst+ 29897 14.4 12.8 1717196 519212 ? Ssl Mar20 2445:11 /usr/bin/eventstored

213# uptime

16:10:32 up 151 days, 4:57, 1 user, load average: 0.51, 0.53, 0.48
eventst+ 25009 15.3 13.0 1713800 530200 ? Ssl Mar31 189:04 /usr/bin/eventstored

``

213 crashed yesterday with these errors:

[PID:17377:035 2016.03.31 18:38:09.661 ERROR QueuedHandlerThreadP] Error while processing message EventStore.Core.Messages.SystemMessage+BecomeShuttingDown in queued handler ‘Subscriptions’.

System.InvalidOperationException: out of sync

at System.Collections.Generic.Dictionary`2+Enumerator[System.Guid,EventStore.Core.Services.SubscriptionsService+Subscription].VerifyState () [0x00000] in :0

at System.Collections.Generic.Dictionary`2+Enumerator[System.Guid,EventStore.Core.Services.SubscriptionsService+Subscription].MoveNext () [0x00000] in :0

at System.Collections.Generic.Dictionary`2+ValueCollection+Enumerator[System.Guid,EventStore.Core.Services.SubscriptionsService+Subscription].MoveNext () [0x00000] in :0

at EventStore.Core.Services.SubscriptionsService.Handle (EventStore.Core.Messages.BecomeShuttingDown message) [0x00000] in :0

at EventStore.Core.Bus.MessageHandler`1[EventStore.Core.Messages.SystemMessage+BecomeShuttingDown].TryHandle (EventStore.Core.Messaging.Message message) [0x00000] in :0

at EventStore.Core.Bus.InMemoryBus.Publish (EventStore.Core.Messaging.Message message) [0x00000] in :0

at EventStore.Core.Bus.InMemoryBus.Handle (EventStore.Core.Messaging.Message message) [0x00000] in :0

at EventStore.Core.Bus.QueuedHandlerThreadPool.ReadFromQueue (System.Object o) [0x00000] in :0

``

Joel_Smith · April 1, 2016, 3:18pm

I assume this is what happens sometimes when the process locks up - when we get the alert message

checking gossip at 192.168.147.211:2113
CheckGossip CRITICAL: Could not connect to http://192.168.147.211:2113/gossip?format=xml to check gossip, has event store fallen over on this node?

Host: ggh-d-evtapp01.dev.green.net
Timestamp: 2016-03-18 00:01:29 +0000
Address: 192.168.147.211
Status: CRITICAL

``

Greg_Young1 · April 1, 2016, 3:38pm

One message can be associated to many things. There are at a slew
reasons why this could happen. Is it transient/permanent? If its
permanent its likely what I am discussing.

TCP and HTTP are different subsystems.

What's interesting is that if you are getting permanent (until node
restart problems on gossip) they are probably fixed by my patch. I was
only able to get them to occur though with flooding connects and we
don't see this behaviour in many many other configurations so I am
trying to figure out what may make an environment more susceptible to
it (and can come up with a few things but would need more fine grained
testing to verify).

Let me squash my commits on it and I can send over a link to the
branch and run binaries on top of it if thats acceptable for you? I
have a feeling it may resolve the issue.

Greg

Joel_Smith · April 1, 2016, 3:48pm

When we alert on failing to connect to the gossip port, it either resolves itself relatively quickly, or gets stuck. The process is still there, but no longer responds to the gossip (we check via the http connection).

We have often seen this after a period with the time out of sync issue. We normally need to restart the process after this. We developed the monitoring check to try and spot the event when the cluster was still working because still two live nodes, but one had died. This allows us to restart the node service without loss of overall service.

We can have a play with the version in Dev to see if it resolves issues.

Greg_Young1 · April 1, 2016, 3:51pm

"We can have a play with the version in Dev to see if it resolves issues."

It won't this is a mono patch unless we build out packages etc for it.

Basically what was happening was they were breaking their
listen->accept->listen async loop which caused no further connections
to get accepted. It was however rather hard for me to reproduce it
though I can imagine ow environments could be more senstive to it.

Greg_Young1 · April 1, 2016, 4:05pm

This is the patch
https://github.com/gregoryyoung/mono/commit/123aaa5db07fe464d12ea178ecc3f27990cc1a72
note that in the original it doesn't handle the finishing
synchronously case which breaks the loop. This would make total sense
for rejected listens (and where http goes away until restart, normally
TCP would still be working in this case). For transient failures its
likely something else (quite possibly even just packets not getting
there, load, etc)

Greg_Young1 · April 4, 2016, 10:19am

This change has been accepted onto mono and can now be tested without
using my branch. Make sure you have commit:
https://github.com/mono/mono/commit/196497780f01da0d275fafa1524bc2100459ee27

to build mono locally
git clone http://github.com/mono/mono
cd mono
./autogen.sh
./configure --prefix=$PREFIX (/opt/mono is a reasonable one)
make
sudo make install

You can then run binaries from here:
https://geteventstore.com/downloads/ (without mono)
mono EventStore.ClusterNode.exe (whatever config)

Let me know if it helps/doesn't help.

Cheers,

Greg

Greg_Young1 · April 18, 2016, 1:02pm

This patch is being brought into the binary releases (back ported) for
3.6.0 which will be released soon.