EventStoreConnection connectivity cycling indefinitely

nmehlei · September 26, 2014, 10:15am

Hi,
for weeks now, I am trying to diagnose a rare problem that occurs every few days.

For no apparent reason, most or all of the EventStoreConnection’s of an app close the socket (cleanly) to the EventStore, then immediately and successfully reconnects only to close the socket again.

Log:

Thu Sep 25 21:43:36.2828 +00:00 2014 MyAppName : Info, ClientAPI TcpConnection closed [21:43:36.282: N10.1.3.4:1113, L10.1.1.4:49251, {63690066-057c-4aa4-ad18-ea7be4706dbe}]: Received bytes: 4744759, Sent bytes: 460351 Send calls: 20921, callbacks: 20921 Receive calls: 21795, callbacks: 21795 Close reason: [Success] Socket closed

Thu Sep 25 21:43:36.2828 +00:00 2014 MyAppName : Trace, TcpPackageConnection: connection [10.1.3.4:1113, L10.1.1.4:49251, {63690066-057c-4aa4-ad18-ea7be4706dbe}] was closed cleanly.

Thu Sep 25 21:43:36.2985 +00:00 2014 MyAppName : Info, EventSubscriber for Stream “MyStream”: EventStoreConnection disconnected

Thu Sep 25 21:43:36.2985 +00:00 2014 MyAppName : Warn, EventSubscriber “MyStream”: Subscription stopped, Reason: ConnectionClosed, Exception: EventStore.ClientAPI.Exceptions.ConnectionClosedException: Connection was closed.

Thu Sep 25 21:43:36.5172 +00:00 2014 MyAppName : Info, EventSubscriber for Stream "MyStream": EventStoreConnection reconnecting

Thu Sep 25 21:43:36.5172 +00:00 2014 MyAppName : Trace, TcpPackageConnection: connected to [10.1.3.4:1113, L10.1.1.4:49823, {9929ffd3-ee4e-4434-b48c-0f1d62718f7b}].

Thu Sep 25 21:43:36.5172 +00:00 2014 MyAppName : Info, EventSubscriber for Stream “MyStream”: EventStoreConnection connected

Thu Sep 25 21:43:39.5251 +00:00 2014 MyAppName : Info, ClientAPI TcpConnection closed [21:43:39.525: N10.1.3.4:1113, L10.1.1.4:49824, {ad5ce277-b856-4b76-847c-67293114c064}]: Received bytes: 22, Sent bytes: 0 Send calls: 0, callbacks: 0 Receive calls: 2, callbacks: 2 Close reason: [Success] Socket closed

Thu Sep 25 21:43:39.5251 +00:00 2014 MyAppName : Trace, TcpPackageConnection: connection [10.1.3.4:1113, L10.1.1.4:49824, {ad5ce277-b856-4b76-847c-67293114c064}] was closed cleanly.

Thu Sep 25 21:43:40.2907 +00:00 2014 MyAppName : Info, EventSubscriber for Stream “MyStream”: EventStoreConnection disconnected

Thu Sep 25 21:43:41.2935 +00:00 2014 MyAppName : Info, EventProcessor “MyStream”: EventStoreConnection reconnecting

Thu Sep 25 21:43:41.2935 +00:00 2014 MyAppName : Trace, TcpPackageConnection: connected to [10.1.3.4:1113, L10.1.1.4:49831, {5e9c3ca2-8da2-40be-8479-19ec1b0669b6}].

Thu Sep 25 21:43:41.2935 +00:00 2014 MyAppName : Info, EventSubscriber for Stream “MyStream”: EventStoreConnection connected

Thu Sep 25 21:43:43.7334 +00:00 2014 MyAppName : Info, EventStoreConnection ‘ES-4a751cfb-8a13-4b72-a293-fc4ff884fc5b’: closing TCP connection [10.1.3.4:1113, L10.1.1.4:49831, {5e9c3ca2-8da2-40be-8479-19ec1b0669b6}] due to HEARTBEAT TIMEOUT at pkgNum 21585.

``

This continues on endlessly, it never starts with Heartbeat Timeouts but always has them after at least 1 or 2 reconnect cycles.

The network connectivity of the (Azure-based) VM on which the Event Store resides is limited to 5 MBit/s, which usually is more than enough because the network usage never goes beyond a fracture of that, but as soon as this cycling begins, Network Out of the EventStore VM is always at the maximum of 5 MBit/s. I’d guess that is because of the overhead of endlessly reconnecting sockets which would also explain the Heartbeat Timeouts.

I can’t find any reason for this, no log entries that explain what could lead to this and no logged exceptions. My best guess is that there is a small bug in the reconnection logic and the first disconnect is just a normal network hickup (as is expected inside Azure) but somehow the reconnect misbehaves after that. Obviously thats just a wild theory.

Restarting the VM with the EventStore-accessing application solves the problem, the endless reconnections stop, network usage of the EventStore-VM goes down to a normal (low) level and after the application is restarted everything works fine again.

EventStore ConnectionSettings are set with KeepReconnecting and KeepRetrying aswell as a custom Logger.

Regards,
Nicolas

Peter_Hageus1 · September 26, 2014, 11:00am

Sounds almost exactly like the problems I had, they were caused by low memory (https://groups.google.com/forum/#!searchin/event-store/heartbeat/event-store/LIC7XUY6rDM/DU8-Nq40VggJ)

/Peter

Greg_Young1 · September 26, 2014, 12:49pm

Do you maybe have a huge event in my stream?

nmehlei · September 26, 2014, 1:17pm

Thanks you two for the fast responses

@Peter Hageus

You’re right, that does sound awfully similar. But while my relatively small Azure VM does have very limited ram, it’s usage did not show much difference over the days before and after the last time this happened.

And it’s impossible to use larger VMs for this role, at least long-term, and short-term wouldn’t help much

But I’ll keep an eye on it and will monitor ram usage more closely the next time this happens.

@Greg

Not that I know of, there is some variety in the size of the events but nothing extreme. Additionally, wouldn’t that be problematic even after the restart?

nmehlei · October 1, 2014, 8:46am

The problem occured again yesterday and memory usage never went higher than 87% of the 768 MB, so I guess we can rule out memory usage.

@Greg

Since the heartbeat timeouts and reconnections won’t stop even after 24 hours and I see no way how my code could be responsible for that (no obvious deadlocks, no exceptions, no invocations for disconnect/reconnect from my side), this is currently a showstopper for me … is there a plan to diagnose this further? How can I help?

Regards

Greg_Young1 · October 1, 2014, 8:50am

If the usage is the ‘reconnects using the bandwidth’ you would be seeing thousands of connections per second. As you can see in your reconnect not much data at all is going over the pipe a few bytes at the time it gets disconnected. Could something else be using the bandwith?

nmehlei · October 1, 2014, 10:46am

Well, that does makes sense, but I have no other explanation. Only if I shutdown/restart the client instance (which has constant 5 MBit/s “receiving” during this cycling) the 5 MBit/s “sending” of the EventStore VM goes down to a few KB/s.
I am not sure how the cycling can fill all 5 MBit/s, but since it can’t really be anything other than GES, it must be the reconnection traffic.

Would server-side log extracts and screenshots of Task Manager and Resource Monitor help?

Regards

Greg_Young1 · October 1, 2014, 10:51am

Try enabling verbose client logging it tells us exactly what the client is doing (to your log interface)

nmehlei · October 1, 2014, 11:15am

Okay thanks, I will do that with the next deployment. Unfortunately that means we’ll have to wait for the next occurence (~ a week) for results.

Greg_Young1 · October 1, 2014, 11:17am

The logs you have basically are closing with only a few messages being sent across will be interesting to see the client side of it how often is it trying retry btw (should be in logs)

Greg_Young1 · October 1, 2014, 11:53am

Just put of curiosity what happens if you restart the client?

jen20 · October 1, 2014, 2:28pm

No “obvious” deadlocks is NOT the same as no deadlocks, especially if you’re using the magic underneath async and await. I’ve hit this repeatedly, though there are some changes afoot in the client which may make it more difficult in future. However, in this
case it sounds like messages are being pumped.

Can you put a packet analyzer between the two VMs and post a trace in order to tell what we’re looking for?

nmehlei · October 1, 2014, 4:54pm

@Greg
If I restart the client, everything goes back to normal. In my opinion, the client is somehow hammering the server until it is forced to stop.

@James

True, though I did go through the code to find anything that could be related to this, I did not find anything.

Unfortunately since I added the verbose logging, I had to restart the client (and thus temporarily delayed the problem), but as soon as it occurs again I will try to analyze the traffic further.

Greg_Young1 · October 8, 2014, 12:10pm

An update on this after review of logs.

The client is reconnecting as it should. It is then executing a readeventsforward 500 against the stream. The connection is then timing out. During the ± 1-2 seconds that operation is running it has pulled down > 500k worth of data before timing out with an app level heartbeat timeout.

The solution should be to either:

increase timeout
use smaller batch sizes on the reading

Cheers,

Greg

nmehlei · October 8, 2014, 12:25pm

Thank you so much!

My Repository already had a maxBatchSize of 100, but I used the default values for the subscriptions, thus the 500. I changed that so that all read accesses now have a maxBatchSize of 50 (for now).

HeartbeatInterval and HeartbeatTimeout were already slightly increased, Interval from 750 to 1200ms and Timeout from 1500 to 2400ms. Should I increase these values further?

I’ll deploy the changes this evening, hopefully the connection cycles are resolved with them.

Thanks,

Nicolas

Greg_Young1 · October 8, 2014, 12:28pm

The timeouts are really depending on your particular environment.
There is a tradeoff between finding a possibly dead node quickly and
possibly timing out an operation that shouldn't be what is the right
value depends on your traffic your networking setup, server/client
speeds etc etc. 1200ms is likely reasonable though