Mass eviction of TCP clients, why would EventStore close TCP connections?

Hi,

I have an EventStore cluster running fine, most of the time. However, occasionally, we’re seeing hickups where a large number of TCP connections appear to be closed. The server simply states that they are closed, as does the clients, it would appear the connections are simply cut off on both ends without a reason.

Is there anything that could cause EventStore to mass evict connected TCP clients?

In the logs for the EventStore node, I see this (followed by a lot of lines describing other connections being closed). Everything then recovers, with TCP reconnects following.

I [00001,80,14:47:38.387] SLOW QUEUE MSG [Worker #5]: TcpSend - 761ms. Q: 0/6.

I [00001,51,14:47:38.918] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 1336ms. Q: 0/0.

I [00001,58,14:47:39.295] ES TcpConnection closed [14:47:39.295: N10.4.5.187:45604, L10.4.4.13:1113, {6be7fdbc-adb7-44e8-b01d-4c32893af28d}]:Send calls: 5331, callbacks: 5331

I [00001,58,14:47:39.295] ES TcpConnection closed [14:47:39.296: N10.4.5.187:45604, L10.4.4.13:1113, {6be7fdbc-adb7-44e8-b01d-4c32893af28d}]:Receive calls: 5330, callbacks: 5330

I [00001,97,14:47:39.295] Lost connection from 10.4.2.247:57822

I [00001,58,14:47:39.295] ES TcpConnection closed [14:47:39.296: N10.4.5.187:45604, L10.4.4.13:1113, {6be7fdbc-adb7-44e8-b01d-4c32893af28d}]:Close reason: [Success] Socket closed

I [00001,58,14:47:39.296] Connection ‘external-normal’ [10.4.5.187:45604, {6be7fdbc-adb7-44e8-b01d-4c32893af28d}] closed: Success.

I [00001,77,14:47:39.296] Lost connection from 10.4.5.187:45604

I [00001,50,14:47:39.306] ES TcpConnection closed [14:47:39.306: N10.4.14.58:45086, L10.4.4.13:1113, {e8f946f2-3029-4d2b-a288-c26afeef86e6}]:Received bytes: 1107177, Sent bytes: 21937708

I [00001,50,14:47:39.306] ES TcpConnection closed [14:47:39.307: N10.4.14.58:45086, L10.4.4.13:1113, {e8f946f2-3029-4d2b-a288-c26afeef86e6}]:Send calls: 47019, callbacks: 47019

I [00001,50,14:47:39.306] ES TcpConnection closed [14:47:39.307: N10.4.14.58:45086, L10.4.4.13:1113, {e8f946f2-3029-4d2b-a288-c26afeef86e6}]:Receive calls: 46744, callbacks: 46744

I [00001,50,14:47:39.306] ES TcpConnection closed [14:47:39.307: N10.4.14.58:45086, L10.4.4.13:1113, {e8f946f2-3029-4d2b-a288-c26afeef86e6}]:Close reason: [Success] Socket closed

Is it possible that disk IO may be starving network traffic or similar? A slow TCP connection starving the others and causing drops?

Cheers,

Kristian

Kristian,
I’m experiencing the same issue. Did you ever find the cause/solution?

Mike

“I [00001,58,14:47:39.295] ES TcpConnection closed [14:47:39.296: N10.4.5.187:45604, L10.4.4.13:1113, {6be7fdbc-adb7-44e8-b01d-4c32893af28d}]:Close reason: [Success] Socket closed
I [00001,58,14:47:39.296] Connection ‘external-normal’ [10.4.5.187:45604, {6be7fdbc-adb7-44e8-b01d-4c32893af28d}] closed: Success.”

What do the client logs say? Could they be closing for some reason either implicit/explicit?

I’ve found VERY SLOW QUEUE MSG errors in the logs while the connections are dropping.

As a result of that, I’ve found a bunch of places where you’ve said not to use it in Docker, in production, which is what we’re doing. So I guess that’s the problem.

{ “PID”: “1”, “ThreadID”: “3156”, “Date”: “2019-12-05T17:33:51.402236Z”, “Level”: “Info”, “Logger”: “TcpConnection”, “Message”: “ES {connectionType} closed [{dateTime:HH:mm:ss.fff}: N{remoteEndPoint}, L{localEndPoint}, {connectionId:B}]:Close reason: [{socketError}] {reason}”, “EventProperties”: { “connectionType”: “TcpConnection”, “dateTime”: “2019-12-05T17:33:51.403005Z”, “remoteEndPoint”: “192.168.128.2:36886”, “localEndPoint”: “192.168.128.12:1113”, “connectionId”: “46157320-4a9a-4dae-b088-08ad67dbde75”, “socketError”: “Success”, “reason”: “Socket closed” } }

A selection of error messages

{ “PID”: “1”, “ThreadID”: “3270”, “Date”: “2019-12-05T09:53:20.539135Z”, “Level”: “Error”, “Logger”: “TcpConnectionMonitor”, “Message”: “# {connection} {sinceLastReceive}ms since last Receive started. No completion callback received, but socket status is READY_FOR_RECEIVE”, “EventProperties”: { “connection”: “192.168.128.13:43724”, “sinceLastReceive”: 745 } }

{ “PID”: “1”, “ThreadID”: " 20", “Date”: “2019-12-05T17:34:06.394022Z”, “Level”: “Error”, “Logger”: “QueuedHandlerMRES”, “Message”: “—!!! VERY SLOW QUEUE MSG [{queue}]: {message} - {elapsed}ms. Q: {prevQueueCount}/{curQueueCount}.”, “EventProperties”: { “queue”: “MainQueue”, “message”: “ConnectionClosed”, “elapsed”: 9726, “prevQueueCount”: 1, “curQueueCount”: 33 } }

{ “PID”: “1”, “ThreadID”: “3363”, “Date”: “2019-12-05T17:34:08.214480Z”, “Level”: “Error”, “Logger”: “QueuedHandlerThreadPool”, “Message”: “—!!! VERY SLOW QUEUE MSG [{queue}]: {message} - {elapsed}ms. Q: {prevQueueCount}/{curQueueCount}.”, “EventProperties”: { “queue”: “MonitoringQueue”, “message”: “GetFreshStats”, “elapsed”: 16811, “prevQueueCount”: 0, “curQueueCount”: 1 } }

{ “PID”: “1”, “ThreadID”: “3156”, “Date”: “2019-12-05T17:38:57.771014Z”, “Level”: “Error”, “Logger”: “InMemoryBus”, “Message”: “—!!! VERY SLOW BUS MSG [{bus}]: {message} - {elapsed}ms. Handler: {handler}.”, “EventProperties”: { “bus”: “Worker #2 Bus”, “message”: “TcpSend”, “elapsed”: 10331, “handler”: “TcpSendService” } }

{ “PID”: “1”, “ThreadID”: " 20", “Date”: “2019-12-05T17:39:37.397660Z”, “Level”: “Error”, “Logger”: “QueuedHandlerMRES”, “Message”: “—!!! VERY SLOW QUEUE MSG [{queue}]: {message} - {elapsed}ms. Q: {prevQueueCount}/{curQueueCount}.”, “EventProperties”: { “queue”: “MainQueue”, “message”: “SubscribeToStream”, “elapsed”: 12392, “prevQueueCount”: 52, “curQueueCount”: 90 } }

{ “PID”: “1”, “ThreadID”: “3034”, “Date”: “2019-12-05T17:41:57.771430Z”, “Level”: “Error”, “Logger”: “TcpConnectionMonitor”, “Message”: “# {connection} {sinceLastSend}ms since last send started. No completion callback received, but socket status is READY_FOR_SEND. In send: {inSendBytes}”, “EventProperties”: { “connection”: “192.168.128.2:42772”, “sinceLastSend”: 2062, “inSendBytes”: 22 } }

The thing is that these outages appear seemingly out of nowhere. It seems to happen at odd hours of the day (like 6:45am today) when nobody is using it, so it doesn’t seem like resources should be low.

What else can I look at?

I don’t know your database configuration but I had a similar issue and we figured out it was our DNS fault (it’s always DNS!)