Outage of EventStore cluster on production environment

Hi all,

Recently on production environment, we experienced three times production downtime caused by the outage of EventStore cluster.

Our applications can’t establish connection to the external secure tcp port after running for a while. We have 3 instances running on Windows Server 2008. Application uses secure tcp connection to write events into one EventStore instance and IEventStoreConnection is registered as InstancePerLifetimeScope.

This problem can be fix by restarting EventStore Server.

Configuration

  • EventStore Server: v3.0.1 OSS
  • External Secure Tcp Port: 2115
  • External Tcp port: 2112
  • External Http port: 2114

Machines

  • EventStore Cluster A - Machine A
  • EventStore Cluster B - Machine B
  • EventStore Cluster C - Machine C
  • Application Server - Machine D (Only talk to A to save events)
    During the outage, 2115 is LISTENING when we use command netstat -a -n. There are only around 10 established connections to the secure tcp port.
    When we try to telnet A 2115 from D, we got connection refused error. But we can even establish connections to 2112, 2114.

During the outage, the memory usage of the Machine A is almost full.

We are looking forward to get any advice or suggestions about this issue.
Thanks in advance.

Error logs on EventStore Server:

[PID:05180:003 2015.07.08 14:14:09.771 DEBUG TcpConnectionSsl    ] Exception during EndRead.
System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
   at System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult)
   --- End of inner exception stack trace ---
   at System.Net.Security._SslStream.EndRead(IAsyncResult asyncResult)
   at EventStore.Transport.Tcp.TcpConnectionSsl.OnEndRead(IAsyncResult ar) in c:\EventStore\src\EventStore.Transport.Tcp\TcpConnectionSsl.cs:line 425
System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host
   at System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult)
[PID:05180:029 2015.07.08 14:30:19.199 INFO  TcpConnectionSsl    ] [S172.20.11.17:55668, L172.20.14.100:2115]: Exception on EndAuthenticateAsServer.
System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
   at System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult)
   --- End of inner exception stack trace ---
   at System.Net.Security.SslState.EndProcessAuthentication(IAsyncResult result)
   at EventStore.Transport.Tcp.TcpConnectionSsl.OnEndAuthenticateAsServer(IAsyncResult ar) in c:\EventStore\src\EventStore.Transport.Tcp\TcpConnectionSsl.cs:line 154
System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host
   at System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult)
[PID:05180:022 2015.07.08 17:25:25.183 INFO  TcpConnectionSsl    ] [S172.20.11.17:56246, L172.20.14.100:2115]: Exception on EndAuthenticateAsServer.
System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
   at System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult)
   --- End of inner exception stack trace ---
   at System.Net.Security.SslState.EndProcessAuthentication(IAsyncResult result)
   at EventStore.Transport.Tcp.TcpConnectionSsl.OnEndAuthenticateAsServer(IAsyncResult ar) in c:\EventStore\src\EventStore.Transport.Tcp\TcpConnectionSsl.cs:line 154
System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host
   at System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult)
[PID:05180:008 2015.07.08 19:20:03.755 INFO  TcpConnectionSsl    ] [S172.20.11.17:50848, L172.20.14.100:2115]: Exception on EndAuthenticateAsServer.
System.IO.IOException: Authentication failed because the remote party has closed the transport stream.
   at System.Net.Security.SslState.EndProcessAuthentication(IAsyncResult result)
   at EventStore.Transport.Tcp.TcpConnectionSsl.OnEndAuthenticateAsServer(IAsyncResult ar) in c:\EventStore\src\EventStore.Transport.Tcp\TcpConnectionSsl.cs:line 154
[PID:05180:008 2015.07.08 19:20:03.817 INFO  TcpConnectionSsl    ] [S172.20.11.17:50847, L172.20.14.100:2115]: Exception on EndAuthenticateAsServer.
System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
   at System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult)
--- End of inner exception stack trace ---

Not strictly related to this (it sounds like it could be a legitimate bug, the SSL endpoints are not often used) but you’re using lifetimes wrong: you should have one connection per app, or a very small pool. If you need instance per lifetime semantics create a flyweight wrapper that talks to one of the long lived connections.

I have a few more questions:

  • how are you measuring the memory usage in order to determine it’s “full”?

  • how are your machines configured

  • is this reproducible or intermittent?

James

Were there messages coinciding with you connecting over telnet? Usually a connection refused is without the software ever having established a connection.

Hi James,

Thanks for your comment. I answered your questions in blue.

  • how are you measuring the memory usage in order to determine it’s “full”?

From the windows resource monitor, there are almost no free memory.

  • how are your machines configured

What configuration do you want to know? physical memory, number of cpu cores or something others?

  • is this reproducible or intermittent?

intermittent and after we restart server, it get back to normal.

在 2015年7月13日星期一 UTC+8上午2:41:58,James Nugent写道:

"- is this reproducible or intermittent?
     intermittent and after we restart server, it get back to normal."

How intermittent and what version of windows / CLR / framework are you running.

Cheers,

Greg

This issue happens around every 4 hours.

  • Windows version: Windows Server 2008 R2
  • CLR Version: 4
  • Framework: 4.5.1
  • EventStore version: 3.0.1 OSS
    It seems this issue is highly related to the inappropriate use of IEventStoreConnection.

After we registered it as single instance. For now there are no outage of EventStore reported.

This makes sense and gives a likely target for reproduction (tons of open closes)