Keep losing connection to EventStore and cannot reconnect unless application is restarted

So this has been something we’ve had problems with in the past almost constantly and we’ve had a workaround but I’m back to trying to fix the actual problem.

  • Randomly, the client will disconnect from the server. I have a log event on ErrorOccured, but that’s not being hit.

  • I can see in the info logs that the connection was closed cleanly and that it’s attempting to reconnect but the reconnection has failed as it has been refused

    1. I [00:04:53 INF] Received bytes: 232929, Sent bytes: 219783

    2. I [00:04:53 INF] Send calls: 9755, callbacks: 9755

    3. I [00:04:53 INF] Receive calls: 9751, callbacks: 9751

    4. I [00:04:53 INF] Close reason: [Success] Socket closed

    5. I [00:04:53 DBG] TcpPackageConnection: connection [xx.xx.xx.52:1113, L10.44.3.4:45499, {9e16ea9e-35b9-495d-b748-632e4a6920f5}] was closed cleanly.
      I [00:04:54 DBG] TcpPackageConnection: connection to [xx.xx.xx.52:1113, Lnull, {7a32e884-8b5b-4e59-92b7-8301275edc94}] failed. Error: ConnectionRefused.

    … 9 more messages like this and then gives up

    ``

  • The close reason makes it sound like the close was deliberate instead of in response to something like a timeout – is this indicative of something or is this just a log that needs to be improved?

  • The application is still running on top of this, so the next request to ES gets a “cannot access disposed object” error and remains broken until the next restart.

  • I’m not sure why the re-connection gets refused, but what I do know is that if completely restart this application, it connects again no problem. So our workaround has been to handled the Closed event by doing an Environment.FailFast and letting kubernetes restart it. When it starts back up, ES instantly connects.
    I’m probably talking out of my ass here, but this is what feels to be happening that exhibits this behavior:

  • The client or server hasn’t heard from the other in a while so it sends out a heartbeat request.

  • That heartbeat doesn’t make it back before the timeout because one of a dozen reasons (the server is running slow, network latency in the cloud, etc.), and the client (in this case) closes the connection and then tries to reconnect.

  • Despite the “disconnected cleanly” status, the server hasn’t yet released the connection object with that id – the client tries to reconnect with the same id and the server rejects it.

  • The client keeps trying this and gives up after 10 tries

  • The application restarts and tries to open another connection to the ES server, but now it has a new tcp connection id so it connects right away and is good until the next incident.
    Could there be any merit to this? It’s hard to test this because it happens so randomly.

I just increased the heartbeat interval to 5000 and heartbeat timeout to 5000 as well and all is running fine at the moment. I’m usually asked to provide logs for this (this is about the 3rd time around this block) but the back and forth on logs hasn’t been very productive, so I’ll still share logs but maybe I could also be educated on what I’m looking for?

  • What are something that might cause this behavior?
  • What am I looking for in the logs that could confirm these likely suspects? Please be clear on which logs I’m looking at in each step, client or server.

Event Store server version 4.1.1.0 (running on a dedicated Azure D2_V2 VM, 3 nodes)

Event Store client version 4.2.0.1 (in a service pod which is running in GKE)

Thanks!

  • The client or server hasn’t heard from the other in a while so it sends out a heartbeat request.
    This is legit.

  • That heartbeat doesn’t make it back before the timeout because one of a dozen reasons (the server is running slow, network latency in the cloud, etc.), and the client (in this case) closes the connection and then tries to reconnect.
    This is legit

  • Despite the “disconnected cleanly” status, the server hasn’t yet released the connection object with that id – the client tries to reconnect with the same id and the server rejects it.
    There is no id. There is no logic that would be rejecting the connection unless something quite odd is going on at an OS etc level?

  • The client keeps trying this and gives up after 10 tries
    This is legit

  • The application restarts and tries to open another connection to the ES server, but now it has a new tcp connection id so it connects right away and is good until the next incident.
    Not sure on this one but doubt it. There isn’t a “tcp connection id” being used above.

    Beyond this you are also getting an active connection refused in your log. This is from the TCP socket or from something in the middle (not sure what routing etc may exist between your client and the node) this is prior to interacting with anything ES based (the socket must be established before anything in the protocol occurs).

There is no id. There is no logic that would be rejecting the connection unless something quite odd is going on at an OS etc level?

Kind of what I’m getting at. I wasn’t referring to the “id” as being an EventStore concept. It was always my understanding that .net assigns an id to a tcp socket. I’ve run into this before when I was writing some HSM software. Because the id was tracked as a private static field inside of the tcp client, if I need to run two jobs concurrently, I had to run them in separate app domains. It was silly. Anyway, I thought that could possibly be related since ES is being used as a singleton. I thought maybe if the connection was dropped, the server still had state for that connection id and thus rejected the new connection – i thought that might be what it meant by “active connection refused”

I am not aware of any situation such as you describe but am open to reviewing a test which shows similar.

I’m happy to produce one, but I’m not quite sure how to recreate the situation that would cause this state.

Enable logs (on connection) and when it happens next it should be enough to go off of. Sorry for not being able to say more but need a bit more information especially as I don’t know how to reproduce locally.

Ok we’ll keep an eye out. I’ve disabled the workaround for one service. Hopefully if it’s going to happen, it does soon and I’ll get back to this thread.