Kurrent (formerly Event Store) Discuss Forum

A dead event store

Rob_Ashton May 8, 2013, 6:32am 1

I’ve praised the stability in the past, but I guess if you’re going to throw 25gb of events at a store and then use projections to generate another 50gb of emitted events then at some point something is going to go wrong…

The question is, is it something I did? (I’m on latest dev btw) - I’m going go guess that a null ref exception isn’t something I should be able to create under ordinary circumstances

System.NullReferenceException: Object reference not set to an instance of an object

at EventStore.Core.TransactionLog.Chunks.TFChunkWriter.Write (EventStore.Core.TransactionLog.LogRecords.LogRecord record, System.Int64& newPos) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.WritePrepareWithRetry (EventStore.Core.TransactionLog.LogRecords.PrepareLogRecord prepare) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.EventStore.Core.Bus.IHandle<EventStore.Core.Messages.StorageMessage.WritePrepares>.Handle (EventStore.Core.Messages.WritePrepares message) [0x00000] in :0

[PID:01668:007 2013.05.08 00:00:00.125 ERROR QueuedHandlerAutoRes] Error while processing message EventStore.Core.Messages.StorageMessage+WritePrepares in queued handler ‘StorageWriterQueue’.

System.NullReferenceException: Object reference not set to an instance of an object

at EventStore.Core.TransactionLog.Chunks.TFChunkWriter.Write (EventStore.Core.TransactionLog.LogRecords.LogRecord record, System.Int64& newPos) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.WritePrepareWithRetry (EventStore.Core.TransactionLog.LogRecords.PrepareLogRecord prepare) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.EventStore.Core.Bus.IHandle<EventStore.Core.Messages.StorageMessage.WritePrepares>.Handle (EventStore.Core.Messages.WritePrepares message) [0x00000] in :0

[PID:01668:007 2013.05.08 00:00:00.125 ERROR StorageWriterService] Exception in writer.

System.NullReferenceException: Object reference not set to an instance of an object

at EventStore.Core.TransactionLog.Chunks.TFChunkWriter.Write (EventStore.Core.TransactionLog.LogRecords.LogRecord record, System.Int64& newPos) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.WritePrepareWithRetry (EventStore.Core.TransactionLog.LogRecords.PrepareLogRecord prepare) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.EventStore.Core.Bus.IHandle<EventStore.Core.Messages.StorageMessage.WritePrepares>.Handle (EventStore.Core.Messages.WritePrepares message) [0x00000] in :0

[PID:01668:007 2013.05.08 00:00:00.125 ERROR QueuedHandlerAutoRes] Error while processing message EventStore.Core.Messages.StorageMessage+WritePrepares in queued handler ‘StorageWriterQueue’.

System.NullReferenceException: Object reference not set to an instance of an object

at EventStore.Core.TransactionLog.Chunks.TFChunkWriter.Write (EventStore.Core.TransactionLog.LogRecords.LogRecord record, System.Int64& newPos) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.WritePrepareWithRetry (EventStore.Core.TransactionLog.LogRecords.PrepareLogRecord prepare) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.EventStore.Core.Bus.IHandle<EventStore.Core.Messages.StorageMessage.WritePrepares>.Handle (EventStore.Core.Messages.WritePrepares message) [0x00000] in :0

[PID:01668:007 2013.05.08 00:00:00.125 ERROR StorageWriterService] Exception in writer.

System.NullReferenceException: Object reference not set to an instance of an object

at EventStore.Core.TransactionLog.Chunks.TFChunkWriter.Write (EventStore.Core.TransactionLog.LogRecords.LogRecord record, System.Int64& newPos) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.WritePrepareWithRetry (EventStore.Core.TransactionLog.LogRecords.PrepareLogRecord prepare) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageWriterService.EventStore.Core.Bus.IHandle<EventStore.Core.Messages.StorageMessage.WritePrepares>.Handle (EventStore.Core.Messages.WritePrepares message) [0x00000] in :0

[PID:01668:007 2013.05.08 00:00:00.125 ERROR QueuedHandlerAutoRes] Error while processing message EventStore.Core.Messages.StorageMessage+WritePrepares in queued handler ‘StorageWriterQueue’.

System.NullReferenceException: Object reference not set to an instance of an object

and so on

I’ve checked the disk and it has another 200gb of space to write to so it’s not that (unless I’m missing some aspect of EC2)

Any ideas? If you grab me on Skype I can give you access to the box

Cheers

Rob

Greg_Young1 May 8, 2013, 6:54am 2

Is this dev or last version?

I have seen this before. My guess is you were on a San and the os lost connection with it (eg file handles are invalid)

Rob_Ashton May 8, 2013, 7:20am 3

Youch, thanks EC2

This is dev, not last version - I’m a linux whore so I build all the things from source because I like to look cool.

I’ve restarted the event store, but it’ll take a while to verify all that data!

Rob_Ashton May 8, 2013, 7:23am 4

How about this one?

[04993,07,07:23:07.918] Error in TableIndex.ReadOffQueue

Too many open files

[04993,07,07:23:07.944] Global Unhandled Exception occurred.

Too many open files

(During ReadIndex rebuilding)

Greg_Young1 May 8, 2013, 7:33am 5

Ulimit likely on their default for small.

http://ss64.com/bash/ulimit.html

Cheers,

Greg

Greg_Young1 May 8, 2013, 7:33am 6

Likely unlimited is being set on file handles at os level http://ss64.com/bash/ulimit.html

Rob_Ashton May 8, 2013, 7:34am 7

I’m not on on small, I’m on xlarge

Rob_Ashton May 8, 2013, 7:34am 8

but okay, I can change that if I really need to

Greg_Young1 May 8, 2013, 7:36am 9

What do they set it to by default? Did you build out the Linux box? Normally ulimit is set quite low for file handles.

Rob_Ashton May 8, 2013, 7:37am 10

Yeah that’s fine, if that’s all it is it’s a common problem with node apps too if they’re opening file handles on web requests

Rob_Ashton May 8, 2013, 7:39am 11

Does the event store know how many records there are for the ReadIndex
re-build? Some indication of percentage progress would be nice - I'm up to
8 million and have no idea if that's good or bad

Rob_Ashton May 8, 2013, 7:40am 12

Or perhaps there is a percentage and I can’t see it because I’m running in screen

Greg_Young1 May 8, 2013, 7:42am 13

Normally you will never more than about 1m it likely was trying to write down indexes and failing previously as it was out of file handles

Rob_Ashton May 8, 2013, 7:45am 14

I got a “shit be corrupt yo’” message on start-up, I guess that is related to the san failing

If it’s possible to detect that when all of this happens and present a “yo, your cloud infrastructure is shite” message rather than dumping endless “I CAN’T WRITE TO DISK” errors? )

Greg_Young1 May 8, 2013, 7:51am 15

We are doing better. Working on how to recover

Rob_Ashton May 8, 2013, 7:51am 16

Yay, GO GO TEAM EVENT STORE

Rob_Ashton May 8, 2013, 8:16am 17

How long is this going to take?

[04993,09,08:16:28.994] Still waiting for chaser to catch up already for 00:25:33.3963690…

[04993,09,08:16:29.094] Still waiting for chaser to catch up already for 00:25:33.4967085…

[04993,09,08:16:29.195] Still waiting for chaser to catch up already for 00:25:33.5969947…

[04993,09,08:16:29.295] Still waiting for chaser to catch up already for 00:25:33.6970915…

[04993,09,08:16:29.396] Still waiting for chaser to catch up already for 00:25:33.7973499…

[04993,09,08:16:29.496] Still waiting for chaser to catch up already for 00:25:33.8975890…

[04993,09,08:16:29.597] Still waiting for chaser to catch up already for 00:25:33.9980107…

[04993,09,08:16:29.697] Still waiting for chaser to catch up already for 00:25:34.0982359…

[04993,09,08:16:29.798] Still waiting for chaser to catch up already for 00:25:34.1987616…

[04993,09,08:16:29.899] Still waiting for chaser to catch up already for 00:25:34.2997389.

Andrii_Nakryiko May 8, 2013, 8:23am 18

Rob,

Before you got all those NullRef exception there had to be another more meaningful exception which will tell more precisely what happened. Most often causes are: out of disk space, or file handle became invalid. If you can look it up, we’ll know better. I have to do better job at failing immediately after first error, but for know I just follow “do no harm” strategy and prevent from writing anymore, that’s why all those NullRef exceptions.
Regarding processing of index on restart. As Greg said, normally you will have to process not more than 1mln of events, and in console there is a message once each 100k are processed (which is pretty quick). In rare cases where failure happened during index dumping, you could end up with whole read index corrupted – in that case we first try to recover quickly from previous backup, otherwise we have to rebuild index completely, which may take some time, especially if you have lots of data.

It seems you have problems with NLog’s ColoredConsoleOutput, can you change NLog.config: comment

and uncomment

Re: waiting for chaser. It seems something went wrong in Chaser queue, and there must have been some log messages saying what was the error. Could you send me the most full logs you have?

Rob_Ashton May 8, 2013, 8:27am 19

I can send you the full logs from yesterday and today, let me just SCP it from the server

Rob_Ashton May 8, 2013, 8:54am 20

Or not because the error logs are several hundred mb haha

We have [PID:01668:349 2013.05.07 21:50:24.370 ERROR TableIndex ] Error in TableIndex.ReadOffQueue

System.IO.IOException: Too many open files

at System.IO.FileStream…ctor (System.String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, Boolean anonymous, FileOptions options) [0x00000] in :0

at System.IO.FileStream…ctor (System.String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options) [0x00000] in :0

at (wrapper remoting-invoke-with-check) System.IO.FileStream:.ctor (string,System.IO.FileMode,System.IO.FileAccess,System.IO.FileShare,int,System.IO.FileOptions)

at EventStore.Core.Index.PTable+WorkItem…ctor (System.String filename, Int32 bufferSize) [0x00000] in :0

at EventStore.Core.Index.PTable+c__Iterator2.MoveNext () [0x00000] in :0

at EventStore.Core.Index.PTable.MergeTo2 (System.Collections.Generic.List1 enumerators, Int64 fileSize, System.String outputFile, System.Func2 isHashCollision, Int32 cacheDepth) [0x00000] in :0

at EventStore.Core.Index.PTable.MergeTo (IList1 tables, System.String outputFile, System.Func2 isHashCollision, Int32 cacheDepth) [0x00000] in :0

That was last night before it all went crap - the funny thing is it wasn’t even doing anything, I wasn’t pumping any new events in and it had finished crunching all the projections I’d given it already

This was followed by

[PID:01668:350 2013.05.07 22:00:22.062 DEBUG MonitoringService ] Couldn’t get free mem on linux, received memory info raw string: []

System.IO.IOException: Error creating standard output pipe

at System.Diagnostics.Process.Start_noshell (System.Diagnostics.ProcessStartInfo startInfo, System.Diagnostics.Process process) [0x00000] in :0

at System.Diagnostics.Process.Start_common (System.Diagnostics.ProcessStartInfo startInfo, System.Diagnostics.Process process) [0x00000] in :0

at System.Diagnostics.Process.Start (System.Diagnostics.ProcessStartInfo startInfo) [0x00000] in :0

at EventStore.Core.Services.Monitoring.Utils.ShellExecutor.GetOutput (System.String command, System.String args) [0x00000] in :0

at EventStore.Core.Services.Monitoring.SystemStatsHelper.GetFreeMemOnLinux () [0x00000] in :0

[PID:01668:352 2013.05.07 22:00:52.065 DEBUG MonitoringService ] couldn’t get drive name for directory /mnt/ghevents on linux

System.IO.IOException: Error creating standard output pipe

at System.Diagnostics.Process.Start_noshell (System.Diagnostics.ProcessStartInfo startInfo, System.Diagnostics.Process process) [0x00000] in :0

at System.Diagnostics.Process.Start_common (System.Diagnostics.ProcessStartInfo startInfo, System.Diagnostics.Process process) [0x00000] in :0

at System.Diagnostics.Process.Start (System.Diagnostics.ProcessStartInfo startInfo) [0x00000] in :0

at EventStore.Core.Services.Monitoring.Utils.ShellExecutor.GetOutput (System.String command, System.String args) [0x00000] in :0

at EventStore.Core.Services.Monitoring.Stats.EsDriveInfo.GetDirectoryRootInLinux (System.String directory, ILogger log) [0x00000] in :0

[PID:01668:352 2013.05.07 22:00:52.066 DEBUG MonitoringService ] Couldn’t get free mem on linux, received memory info raw string: []

System.IO.IOException: Error creating standard output pipe

at System.Diagnostics.Process.Start_noshell (System.Diagnostics.ProcessStartInfo startInfo, System.Diagnostics.Process process) [0x00000] in :0

at System.Diagnostics.Process.Start_common (System.Diagnostics.ProcessStartInfo startInfo, System.Diagnostics.Process process) [0x00000] in :0

at System.Diagnostics.Process.Start (System.Diagnostics.ProcessStartInfo st

So yeah, the san disconnected - no biggy