Event Store started failing a lot more lately and now the server just fully had itself. Here's the latest message

So over the last few weeks, EventStore has just flat out stopped working. Not much in the messages on the client side other than connections are still dropping (as per usual). When we tried to log in, we couldn’t. The eventstore login page would show but the button wouldn’t sign us in. It wouldn’t do anything. No error, no progress, nothing. So I tried to log into one of the other nodes and I was shown this message:

System.IO.IOException: Too many open files
at System.IO.FileStream…ctor (System.String path, System.IO.FileMode mode, System.IO.FileAccess access, System.IO.FileShare share, System.Int32 bufferSize, System.Boolean anonymous, System.IO.FileOptions options) [0x0025f] in <8f2c484307284b51944a1a13a14c0266>:0
at System.IO.FileStream…ctor (System.String path, System.IO.FileMode mode, System.IO.FileAccess access, System.IO.FileShare share) [0x00000] in <8f2c484307284b51944a1a13a14c0266>:0
at (wrapper remoting-invoke-with-check) System.IO.FileStream:.ctor (string,System.IO.FileMode,System.IO.FileAccess,System.IO.FileShare)
at System.IO.File.OpenRead (System.String path) [0x00000] in <8f2c484307284b51944a1a13a14c0266>:0
at System.IO.File.ReadAllBytes (System.String path) [0x00000] in <8f2c484307284b51944a1a13a14c0266>:0
at EventStore.Core.Util.MiniWeb.ReplyWithContent (EventStore.Transport.Http.EntityManagement.HttpEntityManager http, System.String contentLocalPath) [0x001a6] in <82591026fe824176b191ced935e5b4b0>:0

This doesn’t sound like something we would have done, and we’ve done minimal configuration.

I was able to SSH into one of the nodes and pull the error log. Here’s one of the errors.

[PID:91259:006 2018.05.15 13:00:37.953 ERROR Application ] Exiting with exit code: 1.
Exit reason: Verification of chunk #143-143 (chunk-000143.000000) failed, terminating server…
[PID:91679:008 2018.05.15 13:09:25.681 ERROR Application ] Exiting with exit code: 1.
Exit reason: Verification of chunk #172-172 (chunk-000172.000000) failed, terminating server…
[PID:91751:016 2018.05.15 13:09:31.727 ERROR TableIndex ] ReadIndex is corrupted…
EventStore.Core.Exceptions.CorruptIndexException: Error while loading IndexMap. —> System.IO.IOException: Too many open files
at System.IO.FileStream…ctor (System.String path, System.IO.FileMode mode, System.IO.FileAccess access, System.IO.FileShare share, System.Int32 bufferSize, System.Boolean anonymous, System.IO.FileOptions options) [0x0025f] in <8f2c484307284b51944a1a13a14c0266>:0
at System.IO.FileStream…ctor (System.String path, System.IO.FileMode mode, System.IO.FileAccess access, System.IO.FileShare share, System.Int32 bufferSize, System.IO.FileOptions options) [0x00000] in <8f2c484307284b51944a1a13a14c0266>:0
at (wrapper remoting-invoke-with-check) System.IO.FileStream:.ctor (string,System.IO.FileMode,System.IO.FileAccess,System.IO.FileShare,int,System.IO.FileOptions)
at EventStore.Core.Index.PTable+WorkItem…ctor (System.String filename, System.Int32 bufferSize) [0x00006] in <82591026fe824176b191ced935e5b4b0>:0
at EventStore.Core.Index.PTable+c__AnonStorey1.<>m__0 () [0x00000] in <82591026fe824176b191ced935e5b4b0>:0
at EventStore.Core.DataStructures.ObjectPool1[T]..ctor (System.String objectPoolName, System.Int32 initialCount, System.Int32 maxCount, System.Func1[TResult] factory, System.Action1[T] dispose, System.Action1[T] onPoolDisposed) [0x000d1] in <82591026fe824176b191ced935e5b4b0>:0
at EventStore.Core.Index.PTable…ctor (System.String filename, System.Guid id, System.Int32 initialReaders, System.Int32 maxReaders, System.Int32 depth, System.Boolean skipIndexVerify) [0x00122] in <82591026fe824176b191ced935e5b4b0>:0
at EventStore.Core.Index.PTable.FromFile (System.String filename, System.Int32 cacheDepth, System.Boolean skipIndexVerify) [0x0000c] in <82591026fe824176b191ced935e5b4b0>:0
at EventStore.Core.Index.IndexMap.LoadPTables (System.IO.StreamReader reader, System.String indexmapFilename, EventStore.Core.Data.TFPos checkpoints, System.Int32 cacheDepth, System.Boolean skipIndexVerify) [0x0007d] in <82591026fe824176b191ced935e5b4b0>:0
— End of inner exception stack trace —
at EventStore.Core.Index.IndexMap.LoadPTables (System.IO.StreamReader reader, System.String indexmapFilename, EventStore.Core.Data.TFPos checkpoints, System.Int32 cacheDepth, System.Boolean skipIndexVerify) [0x00110] in <82591026fe824176b191ced935e5b4b0>:0
at EventStore.Core.Index.IndexMap.FromFile (System.String filename, System.Int32 maxTablesPerLevel, System.Boolean loadPTables, System.Int32 cacheDepth, System.Boolean skipIndexVerify) [0x00066] in <82591026fe824176b191ced935e5b4b0>:0
at EventStore.Core.Index.TableIndex.Initialize (System.Int64 chaserCheckpoint) [0x000a2] in <82591026fe824176b191ced935e5b4b0>:0
System.IO.IOException: Too many open files
at System.IO.FileStream…ctor (System.String path, System.IO.FileMode mode, System.IO.FileAccess access, System.IO.FileShare share, System.Int32 bufferSize, System.Boolean anonymous, System.IO.FileOptions options) [0x0025f] in <8f2c484307284b51944a1a13a14c0266>:0
at System.IO.FileStream…ctor (System.String path, System.IO.FileMode mode, System.IO.FileAccess access, System.IO.FileShare share, System.Int32 bufferSize, System.IO.FileOptions options) [0x00000] in <8f2c484307284b51944a1a13a14c0266>:0
at (wrapper remoting-invoke-with-check) System.IO.FileStream:.ctor (string,System.IO.FileMode,System.IO.FileAccess,System.IO.FileShare,int,System.IO.FileOptions)
at EventStore.Core.Index.PTable+WorkItem…ctor (System.String filename, System.Int32 bufferSize) [0x00006] in <82591026fe824176b191ced935e5b4b0>:0
at EventStore.Core.Index.PTable+c__AnonStorey1.<>m__0 () [0x00000] in <82591026fe824176b191ced935e5b4b0>:0
at EventStore.Core.DataStructures.ObjectPool1[T]..ctor (System.String objectPoolName, System.Int32 initialCount, System.Int32 maxCount, System.Func1[TResult] factory, System.Action1[T] dispose, System.Action1[T] onPoolDisposed) [0x000d1] in <82591026fe824176b191ced935e5b4b0>:0
at EventStore.Core.Index.PTable…ctor (System.String filename, System.Guid id, System.Int32 initialReaders, System.Int32 maxReaders, System.Int32 depth, System.Boolean skipIndexVerify) [0x00122] in <82591026fe824176b191ced935e5b4b0>:0
at EventStore.Core.Index.PTable.FromFile (System.String filename, System.Int32 cacheDepth, System.Boolean skipIndexVerify) [0x0000c] in <82591026fe824176b191ced935e5b4b0>:0
at EventStore.Core.Index.IndexMap.LoadPTables (System.IO.StreamReader reader, System.String indexmapFilename, EventStore.Core.Data.TFPos checkpoints, System.Int32 cacheDepth, System.Boolean skipIndexVerify) [0x0007d] in <82591026fe824176b191ced935e5b4b0>:0
[PID:91751:016 2018.05.15 13:09:31.734 ERROR TableIndex ] IndexMap ‘/var/lib/eventstore/db/index/indexmap’ content:
000000: 35 34 43 30 34 36 45 30 42 41 37 36 39 37 32 43 | 54C046E0BA76972C
000016: 41 31 44 36 37 32 32 45 30 45 34 39 35 33 33 38 | A1D6722E0E495338
000032: 0A 31 0A 38 36 35 39 33 32 35 31 37 37 2F 38 36 | .1.8659325177/86
000048: 35 39 33 32 35 31 37 37 0A 30 2C 30 2C 65 39 64 | 59325177.0,0,e9d
000064: 36 36 30 61 38 2D 34 31 61 62 2D 34 36 32 38 2D | 660a8-41ab-4628-
000080: 39 33 61 38 2D 30 61 62 34 30 31 31 62 31 63 64 | 93a8-0ab4011b1cd
000096: 63 0A 31 2C 30 2C 36 32 63 62 34 39 63 32 2D 39 | c.1,0,62cb49c2-9
000112: 39 32 36 2D 34 39 63 38 2D 38 39 63 34 2D 34 64 | 926-49c8-89c4-4d
000128: 66 31 64 35 32 34 33 34 61 65 0A | f1d52434ae.

There are others like it but the hex dumps are different.

Too many open files…

Have you changed the default of your linux distribution?

No we didn’t. Ubuntu 16, all default. What should it be?

I’m being told it is currently set to 706196

Er correction, it was set to 1024 but the system is capable of 706196

So increase it.

Lol yeah… we did that in the meanwhile. It looks to be coming up now. So this wasn’t a problem before, is it going to pop up again? Is this something that needs to be monitored? What’s the logic behind how many files get opened? Like, how should we be thinking about this?

The database came back up but now all of the projections are gone, including the default projections D:

Eventually the projections came back… I’m just not yet familiar with the idiosyncrasies here. At any rate… I think it’s back to operating for the time being. We’re still seeing that problem I’ve brought up in the past with all these random disconnections and re-connection failures resulting in the client permanently disconnecting. Still no errors. Just disconnections. sigh…

You need to configure TCP for any web process on a NIX Box and yeah monitorings usually a good idea.

http://www.lognormal.com/blog/2012/09/27/linux-tcpip-tuning/
https://www.cyberciti.biz/faq/linux-tcp-tuning/

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-network-dont-adjust-defaults

Some packages pre-set these by default in the system.d file (which is more the standard in systemd land)

You can override the default limit for the service by setting a systemd override file; e.g.

/etc/systemd/system/.d/override.conf

[Service]
LimitNOFILE=49152

But you should really do that; load test and monitor as they’re pretty much defined by the sizes of your box/setup.

Should add after overriding you need to reload systemd ala systemctl daemon-reload
before reloading the service

Then its systemctl cat servicename to check the setting has been loaded

Thanks Chris, we will look into that today. Do you happen to know if this is stuff is included as part of the paid support package? I saw some stuff in there that looked like it included a monitoring instance in addition to setup scripts that might preconfigure this kind of stuff? We’re probably gonna pick that up pretty soon.