A dead event store

I’ve praised the stability in the past, but I guess if you’re going to throw 25gb of events at a store and then use projections to generate another 50gb of emitted events then at some point something is going to go wrong…

The question is, is it something I did? :slight_smile: (I’m on latest dev btw) - I’m going go guess that a null ref exception isn’t something I should be able to create under ordinary circumstances :slight_smile:

System.NullReferenceException: Object reference not set to an instance of an object

at EventStore.Core.TransactionLog.Chunks.TFChunkWriter.Write (EventStore.Core.TransactionLog.LogRecords.LogRecord record, System.Int64& newPos) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.WritePrepareWithRetry (EventStore.Core.TransactionLog.LogRecords.PrepareLogRecord prepare) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.EventStore.Core.Bus.IHandle<EventStore.Core.Messages.StorageMessage.WritePrepares>.Handle (EventStore.Core.Messages.WritePrepares message) [0x00000] in :0

[PID:01668:007 2013.05.08 00:00:00.125 ERROR QueuedHandlerAutoRes] Error while processing message EventStore.Core.Messages.StorageMessage+WritePrepares in queued handler ‘StorageWriterQueue’.

System.NullReferenceException: Object reference not set to an instance of an object

at EventStore.Core.TransactionLog.Chunks.TFChunkWriter.Write (EventStore.Core.TransactionLog.LogRecords.LogRecord record, System.Int64& newPos) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.WritePrepareWithRetry (EventStore.Core.TransactionLog.LogRecords.PrepareLogRecord prepare) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.EventStore.Core.Bus.IHandle<EventStore.Core.Messages.StorageMessage.WritePrepares>.Handle (EventStore.Core.Messages.WritePrepares message) [0x00000] in :0

[PID:01668:007 2013.05.08 00:00:00.125 ERROR StorageWriterService] Exception in writer.

System.NullReferenceException: Object reference not set to an instance of an object

at EventStore.Core.TransactionLog.Chunks.TFChunkWriter.Write (EventStore.Core.TransactionLog.LogRecords.LogRecord record, System.Int64& newPos) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.WritePrepareWithRetry (EventStore.Core.TransactionLog.LogRecords.PrepareLogRecord prepare) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.EventStore.Core.Bus.IHandle<EventStore.Core.Messages.StorageMessage.WritePrepares>.Handle (EventStore.Core.Messages.WritePrepares message) [0x00000] in :0

[PID:01668:007 2013.05.08 00:00:00.125 ERROR QueuedHandlerAutoRes] Error while processing message EventStore.Core.Messages.StorageMessage+WritePrepares in queued handler ‘StorageWriterQueue’.

System.NullReferenceException: Object reference not set to an instance of an object

at EventStore.Core.TransactionLog.Chunks.TFChunkWriter.Write (EventStore.Core.TransactionLog.LogRecords.LogRecord record, System.Int64& newPos) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.WritePrepareWithRetry (EventStore.Core.TransactionLog.LogRecords.PrepareLogRecord prepare) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.EventStore.Core.Bus.IHandle<EventStore.Core.Messages.StorageMessage.WritePrepares>.Handle (EventStore.Core.Messages.WritePrepares message) [0x00000] in :0

[PID:01668:007 2013.05.08 00:00:00.125 ERROR StorageWriterService] Exception in writer.

System.NullReferenceException: Object reference not set to an instance of an object

at EventStore.Core.TransactionLog.Chunks.TFChunkWriter.Write (EventStore.Core.TransactionLog.LogRecords.LogRecord record, System.Int64& newPos) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.WritePrepareWithRetry (EventStore.Core.TransactionLog.LogRecords.PrepareLogRecord prepare) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.EventStore.Core.Bus.IHandle<EventStore.Core.Messages.StorageMessage.WritePrepares>.Handle (EventStore.Core.Messages.WritePrepares message) [0x00000] in :0

[PID:01668:007 2013.05.08 00:00:00.125 ERROR QueuedHandlerAutoRes] Error while processing message EventStore.Core.Messages.StorageMessage+WritePrepares in queued handler ‘StorageWriterQueue’.

System.NullReferenceException: Object reference not set to an instance of an object

and so on

I’ve checked the disk and it has another 200gb of space to write to so it’s not that (unless I’m missing some aspect of EC2)

Any ideas? If you grab me on Skype I can give you access to the box

Cheers

Rob

Is this dev or last version?

I have seen this before. My guess is you were on a San and the os lost connection with it (eg file handles are invalid)

Youch, thanks EC2

This is dev, not last version - I’m a linux whore so I build all the things from source because I like to look cool.

I’ve restarted the event store, but it’ll take a while to verify all that data!

How about this one?

[04993,07,07:23:07.918] Error in TableIndex.ReadOffQueue

Too many open files

[04993,07,07:23:07.944] Global Unhandled Exception occurred.

Too many open files

(During ReadIndex rebuilding)

Ulimit likely on their default for small.

http://ss64.com/bash/ulimit.html

Cheers,

Greg

Likely unlimited is being set on file handles at os level http://ss64.com/bash/ulimit.html

I’m not on on small, I’m on xlarge

but okay, I can change that if I really need to

What do they set it to by default? Did you build out the Linux box? Normally ulimit is set quite low for file handles.

Yeah that’s fine, if that’s all it is it’s a common problem with node apps too if they’re opening file handles on web requests

Does the event store know how many records there are for the ReadIndex
re-build? Some indication of percentage progress would be nice - I'm up to
8 million and have no idea if that's good or bad :slight_smile:

Or perhaps there is a percentage and I can’t see it because I’m running in screen :slight_smile:

Normally you will never more than about 1m it likely was trying to write down indexes and failing previously as it was out of file handles :slight_smile:

I got a “shit be corrupt yo’” message on start-up, I guess that is related to the san failing

If it’s possible to detect that when all of this happens and present a “yo, your cloud infrastructure is shite” message rather than dumping endless “I CAN’T WRITE TO DISK” errors? )

We are doing better. Working on how to recover :slight_smile:

Yay, GO GO TEAM EVENT STORE

How long is this going to take?

[04993,09,08:16:28.994] Still waiting for chaser to catch up already for 00:25:33.3963690…

[04993,09,08:16:29.094] Still waiting for chaser to catch up already for 00:25:33.4967085…

[04993,09,08:16:29.195] Still waiting for chaser to catch up already for 00:25:33.5969947…

[04993,09,08:16:29.295] Still waiting for chaser to catch up already for 00:25:33.6970915…

[04993,09,08:16:29.396] Still waiting for chaser to catch up already for 00:25:33.7973499…

[04993,09,08:16:29.496] Still waiting for chaser to catch up already for 00:25:33.8975890…

[04993,09,08:16:29.597] Still waiting for chaser to catch up already for 00:25:33.9980107…

[04993,09,08:16:29.697] Still waiting for chaser to catch up already for 00:25:34.0982359…

[04993,09,08:16:29.798] Still waiting for chaser to catch up already for 00:25:34.1987616…

[04993,09,08:16:29.899] Still waiting for chaser to catch up already for 00:25:34.2997389.

:slight_smile:

Rob,

  1. Before you got all those NullRef exception there had to be another more meaningful exception which will tell more precisely what happened. Most often causes are: out of disk space, or file handle became invalid. If you can look it up, we’ll know better. I have to do better job at failing immediately after first error, but for know I just follow “do no harm” strategy and prevent from writing anymore, that’s why all those NullRef exceptions.

  2. Regarding processing of index on restart. As Greg said, normally you will have to process not more than 1mln of events, and in console there is a message once each 100k are processed (which is pretty quick). In rare cases where failure happened during index dumping, you could end up with whole read index corrupted – in that case we first try to recover quickly from previous backup, otherwise we have to rebuild index completely, which may take some time, especially if you have lots of data.

It seems you have problems with NLog’s ColoredConsoleOutput, can you change NLog.config: comment

and uncomment

  1. Re: waiting for chaser. It seems something went wrong in Chaser queue, and there must have been some log messages saying what was the error. Could you send me the most full logs you have?

I can send you the full logs from yesterday and today, let me just SCP it from the server

Or not because the error logs are several hundred mb haha

We have [PID:01668:349 2013.05.07 21:50:24.370 ERROR TableIndex ] Error in TableIndex.ReadOffQueue

System.IO.IOException: Too many open files

at System.IO.FileStream…ctor (System.String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, Boolean anonymous, FileOptions options) [0x00000] in :0

at System.IO.FileStream…ctor (System.String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options) [0x00000] in :0

at (wrapper remoting-invoke-with-check) System.IO.FileStream:.ctor (string,System.IO.FileMode,System.IO.FileAccess,System.IO.FileShare,int,System.IO.FileOptions)

at EventStore.Core.Index.PTable+WorkItem…ctor (System.String filename, Int32 bufferSize) [0x00000] in :0

at EventStore.Core.Index.PTable+c__Iterator2.MoveNext () [0x00000] in :0

at EventStore.Core.Index.PTable.MergeTo2 (System.Collections.Generic.List1 enumerators, Int64 fileSize, System.String outputFile, System.Func2 isHashCollision, Int32 cacheDepth) [0x00000] in :0

at EventStore.Core.Index.PTable.MergeTo (IList1 tables, System.String outputFile, System.Func2 isHashCollision, Int32 cacheDepth) [0x00000] in :0

That was last night before it all went crap - the funny thing is it wasn’t even doing anything, I wasn’t pumping any new events in and it had finished crunching all the projections I’d given it already

This was followed by

[PID:01668:350 2013.05.07 22:00:22.062 DEBUG MonitoringService ] Couldn’t get free mem on linux, received memory info raw string: []

System.IO.IOException: Error creating standard output pipe

at System.Diagnostics.Process.Start_noshell (System.Diagnostics.ProcessStartInfo startInfo, System.Diagnostics.Process process) [0x00000] in :0

at System.Diagnostics.Process.Start_common (System.Diagnostics.ProcessStartInfo startInfo, System.Diagnostics.Process process) [0x00000] in :0

at System.Diagnostics.Process.Start (System.Diagnostics.ProcessStartInfo startInfo) [0x00000] in :0

at EventStore.Core.Services.Monitoring.Utils.ShellExecutor.GetOutput (System.String command, System.String args) [0x00000] in :0

at EventStore.Core.Services.Monitoring.SystemStatsHelper.GetFreeMemOnLinux () [0x00000] in :0

[PID:01668:352 2013.05.07 22:00:52.065 DEBUG MonitoringService ] couldn’t get drive name for directory /mnt/ghevents on linux

System.IO.IOException: Error creating standard output pipe

at System.Diagnostics.Process.Start_noshell (System.Diagnostics.ProcessStartInfo startInfo, System.Diagnostics.Process process) [0x00000] in :0

at System.Diagnostics.Process.Start_common (System.Diagnostics.ProcessStartInfo startInfo, System.Diagnostics.Process process) [0x00000] in :0

at System.Diagnostics.Process.Start (System.Diagnostics.ProcessStartInfo startInfo) [0x00000] in :0

at EventStore.Core.Services.Monitoring.Utils.ShellExecutor.GetOutput (System.String command, System.String args) [0x00000] in :0

at EventStore.Core.Services.Monitoring.Stats.EsDriveInfo.GetDirectoryRootInLinux (System.String directory, ILogger log) [0x00000] in :0

[PID:01668:352 2013.05.07 22:00:52.066 DEBUG MonitoringService ] Couldn’t get free mem on linux, received memory info raw string: []

System.IO.IOException: Error creating standard output pipe

at System.Diagnostics.Process.Start_noshell (System.Diagnostics.ProcessStartInfo startInfo, System.Diagnostics.Process process) [0x00000] in :0

at System.Diagnostics.Process.Start_common (System.Diagnostics.ProcessStartInfo startInfo, System.Diagnostics.Process process) [0x00000] in :0

at System.Diagnostics.Process.Start (System.Diagnostics.ProcessStartInfo st

So yeah, the san disconnected - no biggy