Sure - should just be a tgz file…
tar -xzf combined_file.tgz
``
should open it on a mac
Sure - should just be a tgz file…
tar -xzf combined_file.tgz
``
should open it on a mac
Thanks for your assistance…
Oddly enough, when we get this issue, restarting the eventstore process brings the nodes back into sync with no complaints about time. It’s as if the eventstore builds up a backlog of gossip messages, but my understanding was that gossip messages were discarded if they took too long to process.
They are.
I am noticing there are a whole bunch of issues happening around
midnight (only looked at .212 so far). Heartbeat timeouts, elections
data not being able to be sent etc.
There is a time difference in the middle of it.
Much of this log is showing pretty severe network issues (cause
unknown at this point) as example you have run through 57,000 election
cycles.
On the gossip I think you are right that they are being queued. I just
audited the code and the timeout removes the task but does not cancel
the actual operation so it could just be very slowly trying (and when
they do come through they come through every few ms)
On 17/3 at 23:46 node 211 generates an error log with VERY SLOW BUS MSG alerts. This is just after
2016.03.17 23:44:46.952 DEBUG ClusterVNodeControll] There is NO MASTER or MASTER is DEAD according to GOSSIP. Starting new elections.
``
node 212 had been master.
This was when we got our first monitoring alerts:
checking gossip at 192.168.147.212:2113
Checking for 3 nodes
Checking nodes for IsAlive state
CheckGossip CRITICAL: Only 2 alive nodes, should be 3 alive
Host: ggh-d-evtapp02
Timestamp: 2016-03-17 23:46:29 +0000
Address: 192.168.147.212
``
checking gossip at 192.168.147.213:2113
Checking for 3 nodes
Checking nodes for IsAlive state
CheckGossip CRITICAL: Only 2 alive nodes, should be 3 alive
Host: ggh-d-evtapp03Timestamp: 2016-03-17 23:46:29 +0000
Address: 192.168.147.213
``
checking
gossip at 192.168.147.211:2113
Checking for 3 nodes
Checking nodes for IsAlive state
CheckGossip CRITICAL: Only 1 alive nodes, should be 3 alive
Host: ggh-d-evtapp01
Timestamp: 2016-03-17 23:46:29 +0000
Address: 192.168.147.211
``
So, node 213 and 212 see 2 nodes, node 211 sees one.
All recovers at the next check at 23:47:29
Next alerts:
nodes found with states:
PreReplica when expected Master or Slave.
checking gossip at 192.168.147.212:2113
Checking for 3 nodes
Checking nodes for IsAlive state
Checking for exactly 1 master
Checking node state
Host: ggh-d-evtapp02
Timestamp: 2016-03-18 00:01:29 +0000
Address: 192.168.147.212
Status: WARNING
``
nodes found with states:
PreReplica when expected Master or Slave.
checking gossip at 192.168.147.213:2113
Checking for 3 nodes
Checking nodes for IsAlive state
Checking for exactly 1 master
Checking node state
Host: ggh-d-evtapp03
Timestamp: 2016-03-18 00:01:29 +0000
Address: 192.168.147.213
Status: WARNING
``
checking gossip at 192.168.147.211:2113
CheckGossip CRITICAL: Could not connect to http://192.168.147.211:2113/gossip?format=xml to check gossip, has event store fallen over on this node?
Host: ggh-d-evtapp01.dev.green.net
Timestamp: 2016-03-18 00:01:29 +0000
Address: 192.168.147.211
Status: CRITICAL
``
Recovers at 00:10:29
Repeats at 00:11:29
Recovers at 00:38:20
Repeats at 01:17:29
Continues flapping thereafter until 20:01 where node 211 no longer recovers, and again gets into a Time difference error state. This continues until 20:42 when it reports:
[PID:16535:017 2016.03.18 20:42:14.809 ERROR GossipController ] Received as POST invalid ClusterInfo from [http://192.168.147.211:2113/gossip]. Content-Type: application/json.
[PID:16535:017 2016.03.18 20:42:14.809 ERROR GossipController ] Received as POST invalid ClusterInfo from [http://192.168.147.211:2113/gossip]. Body: .
``
This continues until
[PID:16535:007 2016.03.18 20:47:22.051 ERROR JsonCodec ]
``
During this time, memory use was ballooning, and then
[PID:16535:018 2016.03.18 22:21:00.433 DEBUG MonitoringService ] Could not get free mem on linux, received memory info raw string: []
``
and it was completely fritzed.
So there are two issues here. The first is that the old gossips aren't
really cancelling and come through later. This is relatively straight
forward to deal with at this point.
https://github.com/EventStore/EventStore/blob/release-v3.6.0/src/EventStore.Transport.Http/Client/HttpAsyncClient.cs#L19
you might enjoy the comment. Most likely it was failing on the create,
blocking. There is now a compatible license and we can use the other
code listed. I will put up an issue.
The second issue is what caused everything to get so backed up network
wise. On .212 it looks like all networking is having trouble in
particular with .211 you can see it trying to come back losing
connections etc. .211 doesn't see anyone else (and is getting problems
on all forms of networking tcp and http).
EG:
[PID:19643:021 2016.03.18 00:08:29.503 INFO TcpService ]
Internal TCP connection accepted: [Normal, 192.168.147.211:57344,
L192.168.147.212:1111, {d06ec4e7-e6b4-407a-b224-fc46286112f8}].
[PID:19643:024 2016.03.18 00:08:32.504 TRACE TcpConnectionManager]
Closing connection 'internal-normal' [192.168.147.211:57344,
L192.168.147.212:1111, {d06ec4e7-e6b4-407a-b224-fc46286112f8}]
cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 0
[PID:19643:024 2016.03.18 00:08:32.504 INFO TcpConnection ] ES
TcpConnection closed [00:08:32.505: N192.168.147.211:57344,
L192.168.147.212:1111,
{d06ec4e7-e6b4-407a-b224-fc46286112f8}]:Received bytes: 0, Sent bytes:
22
[PID:19643:024 2016.03.18 00:08:32.505 INFO TcpConnection ] ES
TcpConnection closed [00:08:32.505: N192.168.147.211:57344,
L192.168.147.212:1111, {d06ec4e7-e6b4-407a-b224-fc46286112f8}]:Send
calls: 1, callbacks: 1
[PID:19643:024 2016.03.18 00:08:32.505 INFO TcpConnection ] ES
TcpConnection closed [00:08:32.505: N192.168.147.211:57344,
L192.168.147.212:1111, {d06ec4e7-e6b4-407a-b224-fc46286112f8}]:Receive
calls: 1, callbacks: 0
[PID:19643:024 2016.03.18 00:08:32.506 INFO TcpConnection ] ES
TcpConnection closed [00:08:32.506: N192.168.147.211:57344,
L192.168.147.212:1111, {d06ec4e7-e6b4-407a-b224-fc46286112f8}]:Close
reason: [Success] HEARTBEAT TIMEOUT at msgNum 0
.211
[PID:16535:026 2016.03.18 00:02:37.202 DEBUG ElectionsService ]
ELECTIONS: (V=55714) TIMED OUT! (S=ElectingLeader, M=).
[PID:16535:026 2016.03.18 00:02:37.202 DEBUG ElectionsService ]
ELECTIONS: (V=55715) SHIFT TO LEADER ELECTION.
[PID:16535:026 2016.03.18 00:02:37.202 DEBUG ElectionsService ]
ELECTIONS: (V=55715) VIEWCHANGE FROM [192.168.147.211:2113,
{12c20f0a-fddd-41a9-abf2-25b17eac9ef0}].
[PID:16535:026 2016.03.18 00:02:37.607 INFO ReplicaService ]
Subscribing at LogPosition: 4680637529 (0x116FCDC59) to MASTER
[192.168.147.212:1111, {ca3cda3c-7619-4b57-9bbe-8151ea9e8082}] as
replica with SubscriptionId: {c40c4ee6-bdf3-425d-928c-fc6294b3e139},
ConnectionId: {a92bcebd-2269-47ea-a8ae-bb89a558e3cc}, LocalEndPoint:
[], Epochs:
E2483@4680637299:{d40f40f4-7b50-4cb6-9c22-c87382698475}
During this time, memory use was ballooning, and then
[PID:16535:018 2016.03.18 22:21:00.433 DEBUG MonitoringService ]
Could not get free mem on linux, received memory info rawstring: []
This I think might be due to the queueing of those messages I will
audit that code more closely.
What's also interesting is I have only seen anything like this a few
times (only in azure) I wonder if there is something particular about
how the virtualized networks are getting partitioned in these
environments as opposed to others.
I'm also am going to add monitoring to statistics as well as a
histogram there to get some deeper information.
We are going to try putting all three nodes as a group so that they live on the same ESX host (and so eliminate the networking issue).
This has the downside of meaning that if that ESX host fails, we lose the whole cluster. However, in Dev at least we can do this to eliminate networking as part of the problem.
The networking issue doesn’t seem to fit - We’ve not had any tx or rx errors:
15:15:19 up 150 days, 4:16, 2 users, load average: 0.81, 0.89, 0.68
ggh-d-evtapp01.net.eth0.tx_packets 358948797 1459520042
ggh-d-evtapp01.net.eth0.rx_packets 409045650 1459520042
ggh-d-evtapp01.net.eth0.tx_bytes 117129928809 1459520042
ggh-d-evtapp01.net.eth0.rx_bytes 125511090261 1459520042
ggh-d-evtapp01.net.eth0.tx_errors 0 1459520042
ggh-d-evtapp01.net.eth0.rx_errors 0 1459520042
ggh-d-evtapp01.net.eth0.if_speed 10000 1459520042
``
These are counters, so that is no tx_errors or rx_errors in 150 days. There is no firewall between the nodes.
Indeed, this seems to be the clincher (for 211):
[root@ggh-d-evtapp01:2016-03-31]# netstat -s
Ip:
375175628 total packets received
0 forwarded
0 incoming packets discarded
375162614 incoming packets delivered
359260474 requests sent out
5 outgoing packets dropped
Icmp:
31625 ICMP messages received
116 input ICMP message failed.
ICMP input histogram:
destination unreachable: 2382
echo requests: 29243
30821 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 1586
echo replies: 29235
IcmpMsg:
InType3: 2382
InType8: 29243
OutType0: 29235
OutType3: 1586
Tcp:
1313017 active connections openings
748250 passive connection openings
138137 failed connection attempts
374700 connection resets received
21 connections established
353046858 segments received
362162781 segments send out
2510208 segments retransmited
527276 bad segments received.
996872 resets sent
Udp:
7202637 packets received
249 packets to unknown port received.
0 packet receive errors
7207436 packets sent
UdpLite:
TcpExt:
166485 invalid SYN cookies received
120761 resets received for embryonic SYN_RECV sockets
712704 TCP sockets finished time wait in fast timer
11972772 delayed acks sent
4321 delayed acks further delayed because of locked socket
Quick ack mode was activated 290061 times
4584990 times the listen queue of a socket overflowed
4584990 SYNs to LISTEN sockets dropped
3738 packets directly queued to recvmsg prequeue.
333960446 bytes directly in process context from backlog
3023346 bytes directly received in process context from prequeue
108203159 packet headers predicted
480190 packets header predicted and directly queued to user
41099609 acknowledgments not containing data payload received
108090553 predicted acknowledgments
304 times recovered from packet loss by selective acknowledgements
56 congestion windows recovered without slow start by DSACK
3707 congestion windows recovered without slow start after partial ack
TCPLostRetransmit: 26
51 timeouts after SACK recovery
3 timeouts in loss state
1558 fast retransmits
295 forward retransmits
16805 retransmits in slow start
1672168 other TCP timeouts
TCPLossProbes: 355970
TCPLossProbeRecovery: 212066
16 SACK retransmits failed
290101 DSACKs sent for old packets
1 DSACKs sent for out of order packets
241161 DSACKs received
1 DSACKs for out of order packets received
54257 connections reset due to unexpected data
202077 connections reset due to early user close
15591 connections aborted due to timeout
TCPDSACKIgnoredNoUndo: 159022
TCPSpuriousRTOs: 182
TCPSackShifted: 643
TCPSackMerged: 1418
TCPSackShiftFallback: 6697
TCPRetransFail: 39
TCPRcvCoalesce: 44012879
TCPOFOQueue: 1064
TCPOFOMerge: 1
TCPChallengeACK: 536811
TCPSYNChallenge: 535124
TCPSpuriousRtxHostQueues: 12215
IpExt:
InNoRoutes: 2
InMcastPkts: 739
InBcastPkts: 14880564
InOctets: -1302133305
OutOctets: 1201931353
InMcastOctets: 26604
InBcastOctets: 1679428203
InNoECTPkts: 374870674
InECT0Pkts: 304957
``
In particular the lines:
4584990 times the listen queue of a socket overflowed
4584990 SYNs to LISTEN sockets dropped
``
This suggests that the problem is that the process is not able to accept the
network traffic sufficiently fast and it is being dropped at this point.
On 212:
3266737 times the listen queue of a socket overflowed
3266737 SYNs to LISTEN sockets dropped
``
and on 213:
736787 times the listen queue of a socket overflowed
736787 SYNs to LISTEN sockets dropped
``
On my rabbitmq servers, which have similar uptimes, I don’t see any drops to LISTEN sockets. In fact, a quick grep across the whole of the dev linux estate, doesn’t show this on anything other than the eventstore servers. I suspect the issue is in the way the network traffic is being dealt with inside the process, rather than being the networking itself.
This is listen sockets and listen queue (eg accepting connections).
This would not explain TCP connections all dying at the same time.
How long has your ES nodes been running (estimate on the machine)? And
what are your uptimes?
btw for listen sockets I fixed a bug in mono a couple weeks ago that
will be bundled I think with the next version of ES where it would
stop accepting all sockets (http) until the process was restarted.
Normally I could only make this happen by sending connect floods to
the http server. Have you seen in the runtime of the servers a time
where all http sockets were being refused?
This is listen sockets and listen queue (eg accepting connections).
This would not explain TCP connections all dying at the same time.
But if it is not able to deal with the new connections, is it also possible that it is not able to process existing connections expeditiously? Then, after it is dropped it due to a heartbeat timeout, it would not be able to re-establish the connection?
How long has your ES nodes been running (estimate on the machine)? And
what are your uptimes?
211# uptime
15:15:19 up 150 days, 4:16, 2 users, load average: 0.81, 0.89, 0.68
eventst+ 9556 16.7 13.4 1736832 545916 ? Ssl Mar20 2838:36 /usr/bin/eventstored
212# uptime
16:08:21 up 151 days, 4:55, 1 user, load average: 0.18, 0.31, 0.27
eventst+ 29897 14.4 12.8 1717196 519212 ? Ssl Mar20 2445:11 /usr/bin/eventstored
213# uptime
16:10:32 up 151 days, 4:57, 1 user, load average: 0.51, 0.53, 0.48
eventst+ 25009 15.3 13.0 1713800 530200 ? Ssl Mar31 189:04 /usr/bin/eventstored
``
213 crashed yesterday with these errors:
[PID:17377:035 2016.03.31 18:38:09.661 ERROR QueuedHandlerThreadP] Error while processing message EventStore.Core.Messages.SystemMessage+BecomeShuttingDown in queued handler ‘Subscriptions’.
System.InvalidOperationException: out of sync
at System.Collections.Generic.Dictionary`2+Enumerator[System.Guid,EventStore.Core.Services.SubscriptionsService+Subscription].VerifyState () [0x00000] in :0
at System.Collections.Generic.Dictionary`2+Enumerator[System.Guid,EventStore.Core.Services.SubscriptionsService+Subscription].MoveNext () [0x00000] in :0
at System.Collections.Generic.Dictionary`2+ValueCollection+Enumerator[System.Guid,EventStore.Core.Services.SubscriptionsService+Subscription].MoveNext () [0x00000] in :0
at EventStore.Core.Services.SubscriptionsService.Handle (EventStore.Core.Messages.BecomeShuttingDown message) [0x00000] in :0
at EventStore.Core.Bus.MessageHandler`1[EventStore.Core.Messages.SystemMessage+BecomeShuttingDown].TryHandle (EventStore.Core.Messaging.Message message) [0x00000] in :0
at EventStore.Core.Bus.InMemoryBus.Publish (EventStore.Core.Messaging.Message message) [0x00000] in :0
at EventStore.Core.Bus.InMemoryBus.Handle (EventStore.Core.Messaging.Message message) [0x00000] in :0
at EventStore.Core.Bus.QueuedHandlerThreadPool.ReadFromQueue (System.Object o) [0x00000] in :0
``
I assume this is what happens sometimes when the process locks up - when we get the alert message
checking gossip at 192.168.147.211:2113
CheckGossip CRITICAL: Could not connect to http://192.168.147.211:2113/gossip?format=xml to check gossip, has event store fallen over on this node?
Host: ggh-d-evtapp01.dev.green.net
Timestamp: 2016-03-18 00:01:29 +0000
Address: 192.168.147.211
Status: CRITICAL
``
One message can be associated to many things. There are at a slew
reasons why this could happen. Is it transient/permanent? If its
permanent its likely what I am discussing.
TCP and HTTP are different subsystems.
What's interesting is that if you are getting permanent (until node
restart problems on gossip) they are probably fixed by my patch. I was
only able to get them to occur though with flooding connects and we
don't see this behaviour in many many other configurations so I am
trying to figure out what may make an environment more susceptible to
it (and can come up with a few things but would need more fine grained
testing to verify).
Let me squash my commits on it and I can send over a link to the
branch and run binaries on top of it if thats acceptable for you? I
have a feeling it may resolve the issue.
Greg
When we alert on failing to connect to the gossip port, it either resolves itself relatively quickly, or gets stuck. The process is still there, but no longer responds to the gossip (we check via the http connection).
We have often seen this after a period with the time out of sync issue. We normally need to restart the process after this. We developed the monitoring check to try and spot the event when the cluster was still working because still two live nodes, but one had died. This allows us to restart the node service without loss of overall service.
We can have a play with the version in Dev to see if it resolves issues.
"We can have a play with the version in Dev to see if it resolves issues."
It won't this is a mono patch unless we build out packages etc for it.
Basically what was happening was they were breaking their
listen->accept->listen async loop which caused no further connections
to get accepted. It was however rather hard for me to reproduce it
though I can imagine ow environments could be more senstive to it.
This is the patch
https://github.com/gregoryyoung/mono/commit/123aaa5db07fe464d12ea178ecc3f27990cc1a72
note that in the original it doesn't handle the finishing
synchronously case which breaks the loop. This would make total sense
for rejected listens (and where http goes away until restart, normally
TCP would still be working in this case). For transient failures its
likely something else (quite possibly even just packets not getting
there, load, etc)
This change has been accepted onto mono and can now be tested without
using my branch. Make sure you have commit:
https://github.com/mono/mono/commit/196497780f01da0d275fafa1524bc2100459ee27
to build mono locally
git clone http://github.com/mono/mono
cd mono
./autogen.sh
./configure --prefix=$PREFIX (/opt/mono is a reasonable one)
make
sudo make install
You can then run binaries from here:
https://geteventstore.com/downloads/ (without mono)
mono EventStore.ClusterNode.exe (whatever config)
Let me know if it helps/doesn't help.
Cheers,
Greg
This patch is being brought into the binary releases (back ported) for
3.6.0 which will be released soon.