Miniweb or EventStore frozen ?

Hello guys, I have an instability on a single node test instance I want to get through before going in production.

After a while, the node admin interface becomes unavailable:

$ service eventstore status

``

active & running

my consumers/producers services didn’t notice this, so i’m guessing it’s only miniweb

I also have this errors in the logs :

[PID:08879:058 2017.04.27 15:24:16.951 INFO MiniWeb ] Error while replying from MiniWeb
System.IO.IOException: Unable to write data to the transport connection: The socket has been shut down. —> System.Net.Sockets.SocketException: The socket has been shut down
at System.Net.Sockets.Socket.EndSend (System.IAsyncResult result) [0x00033] in <5071a6e4a4564e19a2eda0f53e42f9bd>:0
at System.Net.Sockets.NetworkStream.EndWrite (System.IAsyncResult asyncResult) [0x0005f] in <5071a6e4a4564e19a2eda0f53e42f9bd>:0
— End of inner exception stack trace —
at System.Net.Sockets.NetworkStream.EndWrite (System.IAsyncResult asyncResult) [0x000af] in <5071a6e4a4564e19a2eda0f53e42f9bd>:0
at System.Net.ResponseStream.EndWrite (System.IAsyncResult ares) [0x00065] in <5071a6e4a4564e19a2eda0f53e42f9bd>:0
at EventStore.Transport.Http.AsyncQueuedBufferWriter.EndWrite (System.IAsyncResult ar) [0x00000] in :0
System.Net.Sockets.SocketException (0x80004005): The socket has been shut down
at System.Net.Sockets.Socket.EndSend (System.IAsyncResult result) [0x00033] in <5071a6e4a4564e19a2eda0f53e42f9bd>:0
at System.Net.Sockets.NetworkStream.EndWrite (System.IAsyncResult asyncResult) [0x0005f] in <5071a6e4a4564e19a2eda0f53e42f9bd>:0
[PID:08879:022 2017.04.27 15:35:50.435 ERROR TcpConnectionMonitor] # :20286 1124ms since last Receive started. No completion callback received, but socket status is READY_FOR_RECEIVE
[PID:08879:046 2017.04.27 15:37:52.834 ERROR TcpConnectionMonitor] # :20268 1433ms since last Receive started. No completion callback received, but socket status is READY_FOR_RECEIVE
[PID:08879:083 2017.04.27 15:38:23.558 ERROR TcpConnectionMonitor] # :20268 1126ms since last Receive started. No completion callback received, but socket status is READY_FOR_RECEIVE
[PID:08879:100 2017.04.27 15:47:02.703 ERROR TcpConnectionMonitor] # :20268 1510ms since last Receive started. No completion callback received, but socket status is READY_FOR_RECEIVE
[PID:08879:061 2017.04.27 16:03:50.634 ERROR TcpConnectionMonitor] # :20109 1629ms since last Receive started. No completion callback received, but socket status is READY_FOR_RECEIVE
[PID:08879:080 2017.04.27 16:13:00.888 ERROR TcpConnectionMonitor] # :20168 962ms since last Receive started. No completion callback received, but socket status is READY_FOR_RECEIVE
[PID:08879:038 2017.04.27 16:17:05.212 ERROR TcpConnectionMonitor] # :20103 1033ms since last Receive started. No completion callback received, but socket status is READY_FOR_RECEIVE
[PID:08879:075 2017.04.27 16:19:08.475 ERROR TcpConnectionMonitor] # :20268 1551ms since last Receive started. No completion callback received, but socket status is READY_FOR_RECEIVE

``

When then have to restart the service to get it back.

Tried to reproduce this pbm this morning by flooding the server, but I just got the projections broken (commit retries timeout) and nothing more.

Here is the configuration :

[PID:31205:001 2017.05.04 16:29:53.953 INFO ProgramBase1 ] ES VERSION: 4.0.1.0 (HEAD/5f53330a4cc31fc6eb8b337cc630038b40a4f47a, Wed, 12 Apr 2017 15:28:32 +0200) [PID:31205:001 2017.05.04 16:29:53.977 INFO ProgramBase1 ] OS: Linux (Unix 4.4.38.1)
[PID:31205:001 2017.05.04 16:29:53.981 INFO ProgramBase1 ] RUNTIME: 4.6.2 (Stable 4.6.2.16/ac9e222) (64-bit) [PID:31205:001 2017.05.04 16:29:53.981 INFO ProgramBase1 ] GC: 2 GENERATIONS
[PID:31205:001 2017.05.04 16:29:53.981 INFO ProgramBase1 ] LOGS: /var/log/eventstore [PID:31205:001 2017.05.04 16:29:53.991 INFO ProgramBase1 ] MODIFIED OPTIONS:

RUN PROJECTIONS: All (Config File)
CLUSTER SIZE: 1 (Config File)
EXT IP: 10.5.133.39 (Config File)
EXT TCP PORT: 1112 (Config File)
EXT HTTP PORT: 2114 (Config File)
EXT HTTP PREFIXES: http://*:2114/ (Config File)
ADD INTERFACE PREFIXES: false (Config File)
DB: /ges/ (Config File)

DEFAULT OPTIONS:

``

Hi,

Could you tell us a bit more about the environment you are running this node in?

Also, would you be able to provide us with full logs of the issue? If you don’t want to post them here, you can send them to hayley at geteventstore.com.

When you say that the projections are broken, do they go into a faulted state?

Hello Hayley

The node is an ubuntu 16.04 virtual machine, with 6 cores, 15 Go of RAM
I just sent you the logs.

Don’t bother with the projections (yes they were faulted and restarted well when asked),
I just mentioned it in a way to say the problem i had was probably not because of too much load (because when I add some heavy load, the projections breaks / not the web UI)

As you can see in the logs, there were none for a gap of 5 days … and this is not normal because the node was running in this days
so I don’t know wtf there is no log at all for 5 days, until I restart the service by

$ service eventstore stop
$ service eventstore start

``

Thank you for the logs.

How many times has this happened? I can only find one instance of it in the error logs, has it happened at other times?
When you say that the admin interface is unavailable, what do you see when you try to navigate to any of the pages?

Also, what kind of load is this node under?

Thanks

I am seeing *many* entries in your log of VVV. What settings are you
running for heartbeat timeouts and what does your network look like?
Are the clients on the same network?

[PID:08879:041 2017.04.27 15:05:21.506 INFO TcpConnection ] ES
TcpConnection closed [15:05:21.506: N212.47.228.233:20146,
L10.5.133.39:1112, {f8eaca37-30b4-4b69-871d-786066bc9456}]:Received
bytes: 74175998, Sent bytes: 31086
[PID:08879:041 2017.04.27 15:05:21.506 INFO TcpConnection ] ES
TcpConnection closed [15:05:21.506: N212.47.228.233:20146,
L10.5.133.39:1112, {f8eaca37-30b4-4b69-871d-786066bc9456}]:Send calls:
306, callbacks: 306
[PID:08879:041 2017.04.27 15:05:21.506 INFO TcpConnection ] ES
TcpConnection closed [15:05:21.506: N212.47.228.233:20146,
L10.5.133.39:1112, {f8eaca37-30b4-4b69-871d-786066bc9456}]:Receive
calls: 9582, callbacks: 9582
[PID:08879:041 2017.04.27 15:05:21.506 INFO TcpConnection ] ES
TcpConnection closed [15:05:21.506: N212.47.228.233:20146,
L10.5.133.39:1112, {f8eaca37-30b4-4b69-871d-786066bc9456}]:Close
reason: [Success] Socket closed
[PID:08879:041 2017.04.27 15:05:21.560 INFO TcpConnectionManager]
Connection 'external-normal:CAT-IMP-ProductPartitionActor'
[212.47.228.233:20146, {f8eaca37-30b4-4b69-871d-786066bc9456}] closed:
Success.
[PID:08879:070 2017.04.27 15:05:21.562 DEBUG PersistentSubscripti]
Lost connection from 212.47.228.233:20146
[PID:08879:044 2017.04.27 15:05:21.618 TRACE InMemoryBus ]
SLOW BUS MSG [bus]: UnwrapEnvelopeMessage - 350ms. Handler:
UnwrapEnvelopeHandler.
[PID:08879:044 2017.04.27 15:05:21.619 TRACE QueuedHandlerAutoRes]
SLOW QUEUE MSG [Projection Core #1]: UnwrapEnvelopeMessage - 351ms. Q:
0/252.
[PID:08879:048 2017.04.27 15:05:21.620 INFO TcpService ]
External TCP connection accepted: [Normal, 212.47.228.233:20040,
L10.5.133.39:1112, {6f45226d-ec04-46da-85bf-ee982e0dea71}].
[PID:08879:055 2017.04.27 15:05:21.632 INFO TcpConnectionManager]
Connection 'external-normal' ({6f45226d-ec04-46da-85bf-ee982e0dea71})
identified by client. Client connection name:
'CAT-IMP-ProductPartitionActor', Client version: V2.

@hayley : it happened 2 or 3 times on the same machine. I have the logs only for the last time because I cleaned everything to install v 4.0.1 hopping it was a know & corrected bug.

So when it happen my browser just wait a fair amount of time, then declaring the request timed out (So I guess there is no http status code since there is no answer, but i’ll have to reproduce to be sure).

We are stressing this machine a bit for test purposes, but i should be in the manageable limits (less than 100 connections, some pooling streams, heavy writes : 8Kb events, 100 000 of it the faster we can through 200 more streams, and projections).

@greg : clients windows vm hosted in KVM, sharing the host public internet ip (only 1 VM per host). But the are on the same provider network (while I don’t know what is done in our provider switches).
I don’t explain yet all those connection close in the log. I have multiple services on a 10 machines cluster orchestration, requesting the node, but all those closed events (from the same port) look suspicious to me too.

Thanks for the info, a couple more questions from me:

  1. What version of Event Store were you on before moving to 4.0.1?

  2. Does the admin UI stay permanently broken, or does it work again if you refresh the page?

  3. Are you seeing any sort of similar behaviour with http clients?

  1. it was the 4.0.0

  2. yes, permanently until I restart the eventstore service, a refresh doesn’t change anything.

  3. I noticed no connection problem on the other clients, I use the C# client, it’s a tcp connection used by default no ?

Yes, the c# client uses a tcp connection.

We’re going to have to investigate this further. The admin UI shouldn’t break like that because of a single error. This might be because of some issues with changing mono versions or something along those lines.

Thank you for your help thus far, if this happens again, please let us know

Just to give you an update on this, I have opened an issue on github to make this easier to track. https://github.com/EventStore/EventStore/issues/1297

Got it again in a clean way (no load on the node)
I sent you the log of the day by comment on issue 1297.

My bad, the miniweb answers again, so it’s perhaps not the same issue,
Suppressed the comment on issue and post it here, just in case you could relate it to the previous one.

Just happened again with no load on the node, it was idle all the morning :

[PID:29388:019 2017.05.10 02:38:54.248 TRACE InMemoryBus         ] SLOW BUS MSG [MainBus]: PurgeTimedOutRequests - 74ms. Handler: HttpService.
[PID:29388:019 2017.05.10 02:38:54.248 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [MainQueue]: PurgeTimedOutRequests - 74ms. Q: 0/1.
[PID:29388:019 2017.05.10 03:03:00.782 TRACE InMemoryBus         ] SLOW BUS MSG [MainBus]: PurgeTimedOutRequests - 51ms. Handler: HttpService.
[PID:29388:019 2017.05.10 03:03:00.790 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [MainQueue]: PurgeTimedOutRequests - 59ms. Q: 0/2.
[PID:29388:019 2017.05.10 03:19:35.322 TRACE InMemoryBus         ] SLOW BUS MSG [MainBus]: ReadAllEventsForward - 56ms. Handler: WideningHandler`2.
[PID:29388:019 2017.05.10 03:19:35.324 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [MainQueue]: ReadAllEventsForward - 59ms. Q: 1/4.
[PID:29388:023 2017.05.10 03:19:35.325 TRACE InMemoryBus         ] SLOW BUS MSG [bus]: UnwrapEnvelopeMessage - 60ms. Handler: UnwrapEnvelopeHandler.
[PID:29388:023 2017.05.10 03:19:35.325 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [Projection Core #0]: UnwrapEnvelopeMessage - 60ms. Q: 0/4.
[PID:29388:019 2017.05.10 03:34:09.322 TRACE InMemoryBus         ] SLOW BUS MSG [MainBus]: ReadAllEventsForward - 81ms. Handler: WideningHandler`2.
[PID:29388:019 2017.05.10 03:34:09.329 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [MainQueue]: ReadAllEventsForward - 88ms. Q: 2/2.
[PID:29388:025 2017.05.10 03:34:09.322 TRACE InMemoryBus         ] SLOW BUS MSG [bus]: UnwrapEnvelopeMessage - 84ms. Handler: UnwrapEnvelopeHandler.
[PID:29388:025 2017.05.10 03:34:09.330 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [Projection Core #2]: UnwrapEnvelopeMessage - 92ms. Q: 0/4.
[PID:29388:019 2017.05.10 05:08:03.447 TRACE InMemoryBus         ] SLOW BUS MSG [MainBus]: PurgeTimedOutRequests - 66ms. Handler: HttpService.
[PID:29388:019 2017.05.10 05:08:03.481 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [MainQueue]: PurgeTimedOutRequests - 135ms. Q: 0/3.
[PID:29388:023 2017.05.10 05:52:15.959 TRACE InMemoryBus         ] SLOW BUS MSG [bus]: UnwrapEnvelopeMessage - 50ms. Handler: UnwrapEnvelopeHandler.
[PID:29388:023 2017.05.10 05:52:15.959 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [Projection Core #0]: UnwrapEnvelopeMessage - 51ms. Q: 0/4.
[PID:29388:062 2017.05.10 06:07:11.329 INFO  TcpConnection       ] ES TcpConnection closed [06:07:11.329: N<IP4>:20210, L<IP1>:1112, {3980b2b2-33b2-4022-8bf3-580dadd88294}]:Received bytes: 1775665, Sent bytes: 2207109
[PID:29388:062 2017.05.10 06:07:11.414 INFO  TcpConnection       ] ES TcpConnection closed [06:07:11.414: N<IP4>:20210, L<IP1>:1112, {3980b2b2-33b2-4022-8bf3-580dadd88294}]:Send calls: 80598, callbacks: 80598
[PID:29388:062 2017.05.10 06:07:11.414 INFO  TcpConnection       ] ES TcpConnection closed [06:07:11.414: N<IP4>:20210, L<IP1>:1112, {3980b2b2-33b2-4022-8bf3-580dadd88294}]:Receive calls: 80599, callbacks: 80599
[PID:29388:062 2017.05.10 06:07:11.414 INFO  TcpConnection       ] ES TcpConnection closed [06:07:11.414: N<IP4>:20210, L<IP1>:1112, {3980b2b2-33b2-4022-8bf3-580dadd88294}]:Close reason: [Success] Socket closed
[PID:29388:062 2017.05.10 06:07:11.414 INFO  TcpConnectionManager] Connection 'external-normal:<SERVICE_NAME>' [<IP4>:20210, {3980b2b2-33b2-4022-8bf3-580dadd88294}] closed: Success.
[PID:29388:040 2017.05.10 06:07:11.416 DEBUG PersistentSubscripti] Lost connection from <IP4>:20210
[PID:29388:061 2017.05.10 06:07:11.533 INFO  TcpService          ] External TCP connection accepted: [Normal, <IP4>:20216, L<IP1>:1112, {a8c54af9-5357-4cac-bb6a-b247618c52d5}].
[PID:29388:035 2017.05.10 06:07:11.535 INFO  TcpConnectionManager] Connection 'external-normal' ({a8c54af9-5357-4cac-bb6a-b247618c52d5}) identified by client. Client connection name: '<SERVICE_NAME>', Client version: V2.
[PID:29388:019 2017.05.10 07:34:13.846 TRACE InMemoryBus         ] SLOW BUS MSG [MainBus]: PersistentSubscriptionTimerTick - 72ms. Handler: WideningHandler`2.
[PID:29388:019 2017.05.10 07:34:13.846 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [MainQueue]: PersistentSubscriptionTimerTick - 92ms. Q: 1/2.
[PID:29388:019 2017.05.10 07:48:47.685 TRACE InMemoryBus         ] SLOW BUS MSG [MainBus]: PurgeTimedOutRequests - 61ms. Handler: HttpService.
[PID:29388:019 2017.05.10 07:48:47.685 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [MainQueue]: PurgeTimedOutRequests - 62ms. Q: 0/1.
[PID:29388:019 2017.05.10 08:19:25.918 TRACE QueuedHandlerAutoRes] SLOW QUEUE MSG [MainQueue]: SendViewChangeProof - 49ms. Q: 0/1.

and then this at noon :

[PID:29388:061 2017.05.10 12:44:07.250 INFO  MiniWeb             ] Error while replying from MiniWeb
System.IO.IOException: Unable to write data to the transport connection: The socket has been shut down. ---> System.Net.Sockets.SocketException: The socket has been shut down
  at System.Net.Sockets.Socket.EndSend (System.IAsyncResult result) [0x00033] in <5071a6e4a4564e19a2eda0f53e42f9bd>:0
  at System.Net.Sockets.NetworkStream.EndWrite (System.IAsyncResult asyncResult) [0x0005f] in <5071a6e4a4564e19a2eda0f53e42f9bd>:0
   --- End of inner exception stack trace ---
  at System.Net.Sockets.NetworkStream.EndWrite (System.IAsyncResult asyncResult) [0x000af] in <5071a6e4a4564e19a2eda0f53e42f9bd>:0
  at System.Net.ResponseStream.EndWrite (System.IAsyncResult ares) [0x00065] in <5071a6e4a4564e19a2eda0f53e42f9bd>:0
  at EventStore.Transport.Http.AsyncQueuedBufferWriter.EndWrite (System.IAsyncResult ar) [0x00000] in <a4bce72d8f6e43208599fe66ae51a7b1>:0
System.Net.Sockets.SocketException (0x80004005): The socket has been shut down
  at System.Net.Sockets.Socket.EndSend (System.IAsyncResult result) [0x00033] in <5071a6e4a4564e19a2eda0f53e42f9bd>:0
  at System.Net.Sockets.NetworkStream.EndWrite (System.IAsyncResult asyncResult) [0x0005f] in <5071a6e