Eventstore issue, need help

Hello team:

There are some event store error in our production env.

The error log like this:

ERROR QueuedHandlerMRES ] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 269391ms. Q: 0/5540.

DEBUG HttpEntityManager ] Error copying forwarded response stream for ‘https://xx.xx.xx.xx:2114/projections/all-non-transient’: The specified network name is no longer available.

DEBUG HttpSendService ] Error occurred while replying to HTTP with message EventStore.Core.Messages.GossipMessage+SendGossip: The specified network name is no longer available.

INFO TcpConnectionManager] Connection ‘external-secure’ [xx.xx.xx.xx:52560, {4e439719-18f4-4b91-8d46-c4baef469754}] closed: SocketError.

INFO TcpConnectionSsl ] ES TcpConnectionSsl closed [01:08:39.100: Nxx.xx.xx.xx:52560, Lxx.xx.xx.xx:2115, {4e439719-18f4-4b91-8d46-c4baef469754}]:Close reason: [SocketError] Exception during EndRead.

DEBUG TcpConnectionSsl ] Exception during BeginWrite.
System.IO.IOException: Unable to write data to the transport connection: Safe handle has been closed. —> System.ObjectDisposedException: Safe handle has been closed
at System.Runtime.InteropServices.SafeHandle.DangerousAddRef(Boolean& success )
at System.StubHelpers.StubHelpers.SafeHandleAddRef(SafeHandle pHandle, Boolean& success )
at System.Net.UnsafeNclNativeMethods.OSSOCK.WSASend(SafeCloseSocket socketHandle, WSABuffer& buffer, Int32 bufferCount, Int32& bytesTransferred, SocketFlags socketFlags, SafeHandle overlapped, IntPtr completionRoutine )
at System.Net.Sockets.Socket.DoBeginSend(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags, OverlappedAsyncResult asyncResult )
at System.Net.Sockets.Socket.BeginSend(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags, SocketError& errorCode, AsyncCallback callback, Object state )
at System.Net.Sockets.NetworkStream.BeginWrite(Byte[] buffer, Int32 offset, Int32 size, AsyncCallback callback, Object state )
— End of inner exception stack trace —
at System.Net.Sockets.NetworkStream.BeginWrite(Byte[] buffer, Int32 offset, Int32 size, AsyncCallback callback, Object state )
at System.Net.Security._SslStream.StartWriting(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest )
at System.Net.Security._SslStream.ProcessWrite(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest )
at System.Net.Security._SslStream.BeginWrite(Byte[] buffer, Int32 offset, Int32 count, AsyncCallback asyncCallback, Object asyncState )
at EventStore.Transport.Tcp.TcpConnectionSsl.TrySend() in c:\projects\eventstore\src\EventStore.Transport.Tcp\TcpConnectionSsl.cs:line 332
System.ObjectDisposedException: Safe handle has been closed
at System.Runtime.InteropServices.SafeHandle.DangerousAddRef(Boolean& success )
at System.StubHelpers.StubHelpers.SafeHandleAddRef(SafeHandle pHandle, Boolean& success )
at System.Net.UnsafeNclNativeMethods.OSSOCK.WSASend(SafeCloseSocket socketHandle, WSABuffer& buffer, Int32 bufferCount, Int32& bytesTransferred, SocketFlags socketFlags, SafeHandle overlapped, IntPtr completionRoutine )
at System.Net.Sockets.Socket.DoBeginSend(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags, OverlappedAsyncResult asyncResult )
at System.Net.Sockets.Socket.BeginSend(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags, SocketError& errorCode, AsyncCallback callback, Object state )
at System.Net.Sockets.NetworkStream.BeginWrite(Byte[] buffer, Int32 offset, Int32 size, AsyncCallback callback, Object state)

When these error occured, Eventstore could not be accessed.

We use Eventstore clusters with 3 nodes, and a hardware load balance. Servers operations are windows 2008 Server R2. Eventstore Version of all nodes are v3.5.0.

I suffered with the Event store problems in past 1 month.

Could you help me.

Thanks very much.

when

The first line in the log that you have given us is very worrying: 270seconds on a write.
Are you seeing this as a reoccurring issue?

You said you’ve suffered with Event Store problems in the past month? What kinds of problems are you seeing? What changed? Was this after an upgrade?

Is your storage network based eg a SAN etc? It looks like this machine
is having networking problems.

Thanks for your reply.

The problem occurred 2~3 times in 1 month.

We upgrade the event store from version 3.0.5 to 3.5.0 recently. But the problem also occurred in the 3.0.5.

We have some monitors to check event store health status using heartbeat request, when the event store is unaccessible(could not write or read), the monitor will send email to me.

And when I received the email, I could see the error log mentioned above.

Once the event store failed, our whole application will not work.

在 2016年7月5日星期二 UTC+8下午7:29:43,Pieter Germishuys写道:

Any problems under SAN?

在 2016年7月5日星期二 UTC+8下午7:32:03,Greg Young写道:

Its taking 270 seconds to process a write in this case, yes thats a
problem. Were there networking issues happening around this time to
the SAN?

No, we have same issues, and the network just usage less 10%, and disk usage keep 1 or 2MB/s.

在 2016年7月5日星期二 UTC+8下午7:49:52,Greg Young写道: