We’ve got about a 50gb ES cluster that decided to do something approaching dying today. I’ve been trying to figure out why for several hours.
Symptoms:
Frequent 503, & 500 errors on attempted writes (with an occasional 408) - we’re talking thousands of failures over the last few hours
The eventstored instances will frequently just restart without any output to the logs
Many many many many leader elections, frequently CONSTANT leader elections
I do see some log messages:
[PID:08573:025 2016.11.06 00:15:40.704 DEBUG MonitoringService ] Could not get free mem on linux, received memory info raw string: []
System.IO.IOException: Error creating standard output pipe
or
[PID:02147:030 2016.11.06 00:39:59.053 TRACE GossipServiceBase ] --------------------------------------------------------------------------------
[PID:02147:026 2016.11.06 00:39:59.063 FATAL GLOBAL-LOGGER ] Global Unhandled Exception occurred.
System.ObjectDisposedException: The object was used after being disposed.
at System.Net.Sockets.Socket.AcceptAsync (System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0
at System.Net.EndPointListener.Accept (System.Net.Sockets.Socket socket, System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0
at System.Net.EndPointListener.ProcessAccept (System.Net.Sockets.SocketAsyncEventArgs args) [0x00000] in <filename unknown>:0
at System.Net.EndPointListener.OnAccept (System.Object sender, System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0
at System.Net.Sockets.SocketAsyncEventArgs.OnCompleted (System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0
at System.Net.Sockets.SocketAsyncEventArgs.AcceptCallback (IAsyncResult ares) [0x00000] in <filename unknown>:0
at System.Net.Sockets.SocketAsyncEventArgs.DispatcherCB (IAsyncResult ares) [0x00000] in <filename unknown>:0
[PID:02147:020 2016.11.06 00:39:59.063 FATAL GLOBAL-LOGGER ] Global Unhandled Exception occurred.
System.ObjectDisposedException: The object was used after being disposed.
at System.Net.Sockets.Socket.AcceptAsync (System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0
at System.Net.EndPointListener.Accept (System.Net.Sockets.Socket socket, System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0
at System.Net.EndPointListener.ProcessAccept (System.Net.Sockets.SocketAsyncEventArgs args) [0x00000] in <filename unknown>:0
at System.Net.EndPointListener.OnAccept (System.Object sender, System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0
at System.Net.Sockets.SocketAsyncEventArgs.OnCompleted (System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0
at System.Net.Sockets.SocketAsyncEventArgs.AcceptCallback (IAsyncResult ares) [0x00000] in <filename unknown>:0
at System.Net.Sockets.SocketAsyncEventArgs.DispatcherCB (IAsyncResult ares) [0x00000] in <filename unknown>:0
[PID:02147:018 2016.11.06 00:39:59.064 FATAL GLOBAL-LOGGER ] Global Unhandled Exception occurred.
System.ObjectDisposedException: The object was used after being disposed.
at System.Net.Sockets.Socket.AcceptAsync (System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0
at System.Net.EndPointListener.Accept (System.Net.Sockets.Socket socket, System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0
at System.Net.EndPointListener.ProcessAccept (System.Net.Sockets.SocketAsyncEventArgs args) [0x00000] in <filename unknown>:0
at System.Net.EndPointListener.OnAccept (System.Object sender, System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0
at System.Net.Sockets.SocketAsyncEventArgs.OnCompleted (System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0
at System.Net.Sockets.SocketAsyncEventArgs.AcceptCallback (IAsyncResult ares) [0x00000] in <filename unknown>:0
at System.Net.Sockets.SocketAsyncEventArgs.DispatcherCB (IAsyncResult ares) [0x00000] in <filename unknown>:0
[PID:02147:016 2016.11.06 00:39:59.063 FATAL GLOBAL-LOGGER ] Global Unhandled Exception occurred.
System.ObjectDisposedException: The object was used after being disposed.
at System.Net.Sockets.Socket.AcceptAsync (System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0
at System.Net.EndPointListener.Accept (System.Net.Sockets.Socket socket, System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0
at System.Net.EndPointListener.ProcessAccept (System.Net.Sockets.SocketAsyncEventArgs args) [0x00000] in <filename unknown>:0
at System.Net.EndPointListener.OnAccept (System.Object sender, System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0
at System.Net.Sockets.SocketAsyncEventArgs.OnCompleted (System.Net.Sockets.SocketAsyncEventArgs e) [0x00000] in <filename unknown>:0
at System.Net.Sockets.SocketAsyncEventArgs.AcceptCallback (IAsyncResult ares) [0x00000] in <filename unknown>:0
at System.Net.Sockets.SocketAsyncEventArgs.DispatcherCB (IAsyncResult ares) [0x00000] in <filename unknown>:0
[PID:02147:030 2016.11.06 00:39:59.396 INFO ClusterVNodeControll] ========== [10.0.23.65:2112] Service 'Master Replication Service' has shut down.
[PID:02147:030 2016.11.06 00:39:59.396 INFO ClusterVNodeControll] ========== [10.0.23.65:2112] All Services Shutdown.
and
there was a log message that looked more specifically about out of file handles, but I'm not finding it at the moment
The cluster is running ES 3.6.3 on AWS on m4.xlarge instances (I moved from m4.large to m4.xlarge today to try to fix this, but to no avail). The database is on an EBS volume.
The system is not under “heavy” load - sometimes it reads pretty frequently (pulling out long streams can put it at a few hundred reads/minute for a few minutes) but otherwise are doing less than 1 write per second.
All three nodes are in the same AZ (us-east-1b), though not in a “group” that would make sure they are in the same rack or whatever.
I have looked through here and seen some things about changing the filehandle limits, so made the eventstore user much higher limits in /etc/security/limits.conf, but that doesn’t seem to help. I am pretty stumped.
This is crippling for us.