ES cluster is intermittently (mostly) dying

We’ve got about a 50gb ES cluster that decided to do something approaching dying today. I’ve been trying to figure out why for several hours.

Symptoms:

Frequent 503, & 500 errors on attempted writes (with an occasional 408) - we’re talking thousands of failures over the last few hours

The eventstored instances will frequently just restart without any output to the logs

Many many many many leader elections, frequently CONSTANT leader elections

I do see some log messages:


[PID:08573:025 2016.11.06 00:15:40.704 DEBUG MonitoringService   ] Could not get free mem on linux, received memory info raw string: []

System.IO.IOException: Error creating standard output pipe

or


[PID:02147:030 2016.11.06 00:39:59.053 TRACE GossipServiceBase   ] --------------------------------------------------------------------------------

[PID:02147:026 2016.11.06 00:39:59.063 FATAL GLOBAL-LOGGER       ] Global Unhandled Exception occurred.

System.ObjectDisposedException: The object was used after being disposed.

at System.Net.Sockets.Socket.AcceptAsync (System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0

at System.Net.EndPointListener.Accept (System.Net.Sockets.Socket socket, System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0

at System.Net.EndPointListener.ProcessAccept (System.Net.Sockets.SocketAsyncEventArgs args) [0x00000] in <filename unknown>:0

at System.Net.EndPointListener.OnAccept (System.Object sender, System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0

at System.Net.Sockets.SocketAsyncEventArgs.OnCompleted (System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0

at System.Net.Sockets.SocketAsyncEventArgs.AcceptCallback (IAsyncResult ares) [0x00000] in <filename unknown>:0

at System.Net.Sockets.SocketAsyncEventArgs.DispatcherCB (IAsyncResult ares) [0x00000] in <filename unknown>:0

[PID:02147:020 2016.11.06 00:39:59.063 FATAL GLOBAL-LOGGER       ] Global Unhandled Exception occurred.

System.ObjectDisposedException: The object was used after being disposed.

at System.Net.Sockets.Socket.AcceptAsync (System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0

at System.Net.EndPointListener.Accept (System.Net.Sockets.Socket socket, System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0

at System.Net.EndPointListener.ProcessAccept (System.Net.Sockets.SocketAsyncEventArgs args) [0x00000] in <filename unknown>:0

at System.Net.EndPointListener.OnAccept (System.Object sender, System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0

at System.Net.Sockets.SocketAsyncEventArgs.OnCompleted (System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0

at System.Net.Sockets.SocketAsyncEventArgs.AcceptCallback (IAsyncResult ares) [0x00000] in <filename unknown>:0

at System.Net.Sockets.SocketAsyncEventArgs.DispatcherCB (IAsyncResult ares) [0x00000] in <filename unknown>:0

[PID:02147:018 2016.11.06 00:39:59.064 FATAL GLOBAL-LOGGER       ] Global Unhandled Exception occurred.

System.ObjectDisposedException: The object was used after being disposed.

at System.Net.Sockets.Socket.AcceptAsync (System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0

at System.Net.EndPointListener.Accept (System.Net.Sockets.Socket socket, System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0

at System.Net.EndPointListener.ProcessAccept (System.Net.Sockets.SocketAsyncEventArgs args) [0x00000] in <filename unknown>:0

at System.Net.EndPointListener.OnAccept (System.Object sender, System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0

at System.Net.Sockets.SocketAsyncEventArgs.OnCompleted (System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0

at System.Net.Sockets.SocketAsyncEventArgs.AcceptCallback (IAsyncResult ares) [0x00000] in <filename unknown>:0

at System.Net.Sockets.SocketAsyncEventArgs.DispatcherCB (IAsyncResult ares) [0x00000] in <filename unknown>:0

[PID:02147:016 2016.11.06 00:39:59.063 FATAL GLOBAL-LOGGER       ] Global Unhandled Exception occurred.

System.ObjectDisposedException: The object was used after being disposed.

at System.Net.Sockets.Socket.AcceptAsync (System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0

at System.Net.EndPointListener.Accept (System.Net.Sockets.Socket socket, System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0

at System.Net.EndPointListener.ProcessAccept (System.Net.Sockets.SocketAsyncEventArgs args) [0x00000] in <filename unknown>:0

at System.Net.EndPointListener.OnAccept (System.Object sender, System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0

at System.Net.Sockets.SocketAsyncEventArgs.OnCompleted (System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0

at System.Net.Sockets.SocketAsyncEventArgs.AcceptCallback (IAsyncResult ares) [0x00000] in <filename unknown>:0

at System.Net.Sockets.SocketAsyncEventArgs.DispatcherCB (IAsyncResult ares) [0x00000] in <filename unknown>:0

[PID:02147:030 2016.11.06 00:39:59.396 INFO  ClusterVNodeControll] ========== [10.0.23.65:2112] Service 'Master Replication Service' has shut down.

[PID:02147:030 2016.11.06 00:39:59.396 INFO  ClusterVNodeControll] ========== [10.0.23.65:2112] All Services Shutdown.

and


there was a log message that looked more specifically about out of file handles, but I'm not finding it at the moment

The cluster is running ES 3.6.3 on AWS on m4.xlarge instances (I moved from m4.large to m4.xlarge today to try to fix this, but to no avail). The database is on an EBS volume.

The system is not under “heavy” load - sometimes it reads pretty frequently (pulling out long streams can put it at a few hundred reads/minute for a few minutes) but otherwise are doing less than 1 write per second.

All three nodes are in the same AZ (us-east-1b), though not in a “group” that would make sure they are in the same rack or whatever.

I have looked through here and seen some things about changing the filehandle limits, so made the eventstore user much higher limits in /etc/security/limits.conf, but that doesn’t seem to help. I am pretty stumped.

This is crippling for us.

So decided to upgrade to 3.9.2 since we’re down anyway, and it seemed to go fine - cluster is still broken-ish, but things loaded, but this line jumped out at me:

Closing connection ‘master-normal’ [10.0.23.65:1112, L10.0.23.209:46183, {9ef84d7c-dcc6-4d73-8591-5cf15055c5

61}] cleanly. Reason: Reconnecting from old master [10.0.23.65:1112] to new master: [10.0.23.65:1112].

ES version?

a log?

3.6.3 - though I just upgraded to 3.9.2.

Here is a link to the log from one node

https://www.dropbox.com/s/o7tkk4pyphpb9v8/10.0.23.209-2113-cluster-node.log.gz?dl=0

I did just raise the file limits by editing the

/etc/init/eventstore.conf

And adding:

limit nofile 4096 65000

ANd restarted. Now the limits on the process look sane (cat /proc//limits) (the previous work of changing /etc/security/limits.conf seems to not have worked).

We’ll see if this stabilizes it.