Connection disconnects spontaneously

nmehlei · March 6, 2015, 2:51pm

Hi,

this issue was partially raised in https://groups.google.com/forum/#!topic/event-store/X6AaqXzwLj0 and might be connected with it, but I think it would confuse too much if we were to discuss them in the same thread.

In our (Azure-hosted) application, for some reason, EventStore connections seem to disconnect for no apparent reason. It’s very hard to see a pattern here, but as far as I can see, the time between disconnects ranges between a few seconds and 15 minutes and sometimes only a few and sometimes all connections disconnect simultaneously.

It does not seem to happen more or less under load, but it’s hard to tell. Client-side hearbeat timeout is experimentally set to 3s.

Server-Log: https://dl.dropboxusercontent.com/u/19676954/reconnect-diagnose/10.2.4.4-2113-cluster-node.log_2015-03-06_133645_2015-03-06_135645.log

Client-Log: https://dl.dropboxusercontent.com/u/19676954/reconnect-diagnose/RedactedClientLog.txt

The client log extract is at the same time like the 3rd block in the server log.

A few things that I tried already (can’t remember all of them)

Playing around with client-side heartbeat timeout and heartbeat interval
use bigger Azure VMs (ranging from ExtraSmall to D1 (3.5 GB ram))
using multiple and using only one EventStoreConnection instance
multiple version including the most-current stable.
creating completely new hosting Azure VM

To be honest, I don’t know what I could try anmore, so any guidance would be very much appreciated.

Regards,

Nicolas

Greg_Young1 · March 6, 2015, 3:03pm

Would it be more likely that they are actually losing network
connectivity in a shared environment like the cloud? Especially if all
your client connections happen to go down at the same time and only
occasionally?

There is a fix for a possible reconnect never working but I don't
think thats related

Greg_Young1 · March 6, 2015, 3:07pm

EG: Here they disconnect/reconnect and work. I would probably guess
this was just a networking error (they occasionally happen)

Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, EventStoreConnection 'ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1':
TimerTick checking reconnection....
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Warn, VolatileProjectionModule "context1": EventStoreConnection
reconnecting, connection name:
"ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1"
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Warn, VolatileProjectionModule "context2": EventStoreConnection
reconnecting, connection name:
"ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1"
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Warn, VolatileProjectionModule "context6": EventStoreConnection
reconnecting, connection name:
"ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1"
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Warn, VolatileProjectionModule "context7": EventStoreConnection
reconnecting, connection name:
"ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1"
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Warn, PersistentProjectionModule "context3": EventStoreConnection
reconnecting, connection name:
"ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1"
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Warn, PersistentProjectionModule "context5": EventStoreConnection
reconnecting, connection name:
"ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1"
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Warn, PersistentProjectionModule "context8": EventStoreConnection
reconnecting, connection name:
"ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1"
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Warn, PersistentProjectionModule "context4": EventStoreConnection
reconnecting, connection name:
"ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1"
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, EventStoreConnection 'ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1':
DiscoverEndPoint.
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, EventStoreConnection 'ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1':
enqueueing message
EventStore.ClientAPI.Core.EstablishTcpConnectionMessage..
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, EventStoreConnection 'ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1':
EstablishTcpConnection to [10.2.4.4:1113].
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, TcpPackageConnection: connected to [10.2.4.4:1113,
L10.2.2.4:49244, {f9d9eec1-69e3-4980-85b5-a65f083f7ec3}].
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, EventStoreConnection 'ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1':
enqueueing message
EventStore.ClientAPI.Core.TcpConnectionEstablishedMessage..
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, EventStoreConnection 'ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1':
TCP connection to [10.2.4.4:1113, L10.2.2.4:49244,
{f9d9eec1-69e3-4980-85b5-a65f083f7ec3}] established..
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Info, VolatileProjectionModule "context1": EventStoreConnection
connected, connection name: "ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1"
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Info, VolatileProjectionModule "context2": EventStoreConnection
connected, connection name: "ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1"
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Info, VolatileProjectionModule "context6": EventStoreConnection
connected, connection name: "ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1"
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Info, VolatileProjectionModule "context7": EventStoreConnection
connected, connection name: "ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1"
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Info, PersistentProjectionModule "context3": EventStoreConnection
connected, connection name: "ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1"
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Info, PersistentProjectionModule "context5": EventStoreConnection
connected, connection name: "ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1"
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Info, PersistentProjectionModule "context8": EventStoreConnection
connected, connection name: "ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1"
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Info, PersistentProjectionModule "context4": EventStoreConnection
connected, connection name: "ES-abb8e0f1-3c04-4931-a6f0-c81a40bac6b1"
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, Catch-up Subscription to {0}: recovering after reconnection.
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, Catch-up Subscription to {0}: unhooking from
connection.Connected.
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, Catch-up Subscription to {0}: recovering after reconnection.
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, Catch-up Subscription to {0}: unhooking from
connection.Connected.
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, Catch-up Subscription to {0}: recovering after reconnection.
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, Catch-up Subscription to {0}: unhooking from
connection.Connected.
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, Catch-up Subscription to {0}: recovering after reconnection.
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, Catch-up Subscription to {0}: unhooking from
connection.Connected.
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, Catch-up Subscription to {0}: recovering after reconnection.
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, Catch-up Subscription to {0}: unhooking from
connection.Connected.
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, Catch-up Subscription to {0}: recovering after reconnection.
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, Catch-up Subscription to {0}: unhooking from
connection.Connected.
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, Catch-up Subscription to {0}: recovering after reconnection.
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, Catch-up Subscription to {0}: unhooking from
connection.Connected.
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, Catch-up Subscription to {0}: recovering after reconnection.
Fri Mar 06 13:45:57.7446 +00:00 2015 ProjectionHost (Hostname1) :
Trace, Catch-up Subscription to {0}: unhooking from
connection.Connected.

nmehlei · March 6, 2015, 3:27pm

Normally that would be a reasonable explanation for the cloud, but I really don’t think that is the case here.

Other connections (constant ping, open redis connections, etc.) don’t seem to be affected at all and since these event store disconnections are logged a few hundred times a day (for three months or so now), network disruptions would’ve been noticed with the other network usages.

It looks like something in the ClientAPI just decides to end the connection (for some reason) or this specific usage combination kills the connection.

For example, too big a batch size leads to connection disruptions (we had this issue a few days ago in this newsgroup), maybe something like that happens here. (Though not this exact reason because I limited the batch size already).

Greg_Young1 · March 6, 2015, 3:29pm

"For example, too big a batch size leads to connection disruptions (we
had this issue a few days ago in this newsgroup), maybe something like
that happens here. (Though not this exact reason because I limited the
batch size already)."

This can happen. Though there is a change in the pipe to make that not
happen any more.

Looking at your log it looks like we handled an OS Connection Closed.

Greg_Young1 · March 6, 2015, 3:30pm

Wait server is heartbeat timeout ...

You said you have 4 second client timeouts... What do you have set on
server timeouts?

nmehlei · March 6, 2015, 3:37pm

I did not change the server settings for TCP heartbeat, only on the client side … does one need to change both? (thinking about it, would make sense).

On client-side, I have Heartbeat timeout to 3s (default 1.5s).

Would I need to change IntTcpHeartbeatTimeout or ExtTcpHeartbeatTimeout on the server-side?

Greg_Young1 · March 6, 2015, 3:42pm

For clients Ext. The other is for internal cluster communications

nmehlei · March 6, 2015, 4:54pm

Looks good for now, no new warnings yet, I’ll monitor it for a while and report back

Greg_Young1 · March 7, 2015, 12:36pm

Can you let us know? I have spent the last couple hours reviewing the
heartbeat code and it seems reasonable and should not be causing
random drops.

If you think you are still seeing random drops can we grab some logs of these?

nmehlei · March 9, 2015, 1:42pm

Sorry for the late response, I wasn’t able to take a look over the weekend.

Unfortunately, over the course of three days (since Friday), logging shows approximately 100 disconnects and 72 entries with like this:

Error, EventStoreConnection ‘ES-a802af33-3245-44ba-bc54-2583285c89b6’: operation never got response from server. UTC now: 06:34:45.017, operation: Operation ReadStreamEventsBackwardOperation ({9344dfe1-05f8-4f73-83b2-e246ea7eba9d}): Stream: context-mycontext, FromEventNumber: -1, MaxCount: 1, ResolveLinkTos: True, RequireMaster: True, retry count: 1, created: 06:34:35.516, last updated: 06:34:37.626

``

This happened in two entirely independant azure environments, so it can’t be because of a faulty VM etc.

A few of my servers have relatively high ram usage (95% of 768 MiB) and I think I remember reading in this group that high ram usage can lead to heartbeat timeouts, but hopefully not in this fashion. Additionally, the VMs don’t really need more ram, as the current data sets are very small.

Is there anything I can do to get more information about those timeouts? I already have verbose debugging turned on, but I am not able to read much out of it.

Greg_Young1 · March 9, 2015, 1:52pm

What more would you like to know about timeouts except that they timed
out? In terms of checking it out there is one thing that might help.
If you install dev and try going to /jitter you will get a histogram
about jitter (could be caused by loads of things).

at 95% RAM are you paging etc? Have you taken steps to lowering memory
usage? I am guessing this is windows as it is normally the one with
problems. http://www.uwe-sieber.de/ntcacheset_e.html can help to lower
the amount of memory given to file cache (it will eat all memory by
default). You can also get around this one by using --unbuffered on
dev.

Also set -c 1 on command line.

What is your timeouts btw?

Error, EventStoreConnection 'ES-a802af33-3245-44ba-bc54-2583285c89b6':
operation never got response from server. UTC now: 06:34:45.017,
operation: OperationReadStreamEventsBackwardOperation
({9344dfe1-05f8-4f73-83b2-e246ea7eba9d}): Stream: context-mycontext,
FromEventNumber: -1, MaxCount: 1, ResolveLinkTos: True,RequireMaster:
True, retry count: 1, created: 06:34:35.516, last updated:
06:34:37.626

Is not a heartbeat timeout its an operation timeout and will be
retried. On azure this does not surprise me in the slightest (we have
seen reads to blob storage take > 10 seconds for say a 4096 bytes at
times)

Cheers,

Greg

Greg_Young1 · March 9, 2015, 2:04pm

sorry /histogram/jitter

nmehlei · March 9, 2015, 2:47pm

Honestly, I don’t really know … as I don’t really know why the timeouts are happening, I am just trying to get as much information as possible.

Did you mean http://[myIP]:2113/web/index.html#/histogram/jitter ? Because that url doesn’t work in dev, but I didn’t find anything via google what another url could be.

Yes all VMs are Windows-based. Ram usage on the EventStore host is no problem (at least 2 GiB are free at all times), the estimate of the mentioned 95% was only for the application VMs that access the EventStore. I don’t think that the windows file cache is a factor here, because the mentioned application VMs don’t really work with files.

Timeouts are set to 3s,

Server: -ExtTcpHeartbeatTimeout 3000

Client: […].SetHeartbeatTimeout(TimeSpan.FromMilliseconds(3000))

So are you saying that those operation timeout log entries can be ignored? As I set the client to “[…].KeepRetrying()”, the log entry was purely informational?

Thanks,

Nicolas

Greg_Young1 · March 9, 2015, 2:49pm

"So are you saying that those operation timeout log entries can be
ignored? As I set the client to "[...].KeepRetrying()", the log entry
was purely informational?"

Yes the client has an operation timeout as well. The operation just
timed out when requested. That can happen sometimes especially in the
cloud.

3 seconds is quite a bit. Which side is timing out the heartbeat
client or server?

nmehlei · March 9, 2015, 3:38pm

Indeed, 3 seconds is a lot, but I hoped that it would be enough to limit those timeouts.

How do I find out who initiated the timeout?

Greg_Young1 · March 9, 2015, 3:39pm

Which log has the closing due to heartbeat timeout in it.

nmehlei · March 9, 2015, 3:50pm

Looks to me like it was server-initiated.

…TRACE TcpConnectionManager] Closing connection ‘external-normal’ [X.X.2.4:52532, LX.X.4.4:1113, {3a56582c-3672-46aa-bc4a-ac4fd9999ffa}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 568

``

One second after that, the (client-side) lines about disconnected, reconnecting and connected appear below that.

nmehlei · March 10, 2015, 2:09pm

What did that tell us?

Greg_Young1 · March 10, 2015, 2:12pm

Are they all server closed or are some client closed? You said you had
about 100.

Especially in the cloud I might expect some timeouts to be hit but 50
a day is a bit much. Is your connection between two nodes in the cloud
or a node in the cloud and say a node locally?