ERROR QueuedHandlerMRES] Error EventStore.Core.Messages.ClientMessage+ReadAllEventsForward in queued handler 'StorageReaderQueue #2'.

Bindiya · May 12, 2014, 8:25pm

I am seeing the following error in the GES log files:

[PID:05128:011 2014.05.12 00:00:06.152 ERROR QueuedHandlerMRES ] Error while processing message EventStore.Core.Messages.ClientMessage+ReadAllEventsForward in queued handler ‘StorageReaderQueue #2’.

System.ArgumentException: Log record at actual pos 33507792 has too large length: 1702195501 bytes, while limit is 16777216 bytes. Something is seriously wrong in chunk 54-54 (E:\SecureData\App_Data\EventStoreDb\chunk-000054.000000).

at EventStore.Core.TransactionLog.Chunks.TFChunk.TFChunk.TFChunkReadSide.TryReadForwardInternal(ReaderWorkItem workItem, Int64 actualPosition, Int32& length, LogRecord& record) in c:\Dev\OSS\EventStore\src\EventStore\EventStore.Core\TransactionLog\Chunks\TFChunk\TFChunkReadSide.cs:line 485

at EventStore.Core.TransactionLog.Chunks.TFChunk.TFChunk.TFChunkReadSideUnscavenged.TryReadClosestForward(Int64 logicalPosition) in c:\Dev\OSS\EventStore\src\EventStore\EventStore.Core\TransactionLog\Chunks\TFChunk\TFChunkReadSide.cs:line 104

at EventStore.Core.TransactionLog.Chunks.TFChunkReader.TryReadNextInternal(Int64 position, Int32 retries) in c:\Dev\OSS\EventStore\src\EventStore\EventStore.Core\TransactionLog\Chunks\TFChunkReader.cs:line 86

at EventStore.Core.Services.Storage.ReaderIndex.ReadIndex.EventStore.Core.Services.Storage.ReaderIndex.IReadIndex.ReadAllEventsForward(TFPos pos, Int32 maxCount) in c:\Dev\OSS\EventStore\src\EventStore\EventStore.Core\Services\Storage\ReaderIndex\ReadIndex.cs:line 593

at EventStore.Core.Services.Storage.StorageReaderWorker.EventStore.Core.Bus.IHandle<EventStore.Core.Messages.ClientMessage.ReadAllEventsForward>.Handle(ReadAllEventsForward message) in c:\Dev\OSS\EventStore\src\EventStore\EventStore.Core\Services\Storage\StorageReaderWorker.cs:line 265

at EventStore.Core.Bus.MessageHandler`1.TryHandle(Message message) in c:\Dev\OSS\EventStore\src\EventStore\EventStore.Core\Bus\MessageHandler.cs:line 60

at EventStore.Core.Bus.InMemoryBus.PublishByType(Message message, Type type) in c:\Dev\OSS\EventStore\src\EventStore\EventStore.Core\Bus\InMemoryBus.cs:line 127

at EventStore.Core.Bus.InMemoryBus.DispatchByType(Message message) in c:\Dev\OSS\EventStore\src\EventStore\EventStore.Core\Bus\InMemoryBus.cs:line 105

at EventStore.Core.Bus.QueuedHandlerMRES.ReadFromQueue(Object o) in c:\Dev\OSS\EventStore\src\EventStore\EventStore.Core\Bus\QueuedHandlerMRES.cs:line 140

Any ideas on what went wrong here?

Thanks

Bindiya

jen20 · May 12, 2014, 8:28pm

How many chunks do you have?

Is there any other background information (eg just restored from a backup etc)?

Cheers,

James

Bindiya · May 12, 2014, 8:32pm

Hey James,
Thanks for the quick response. I have 54 chunks. The client just sent us their production copy of GES and log files. I saw the following exception in the log files when investigating a defect.

Thanks

jen20 · May 12, 2014, 9:22pm

The error message suggests a corrupted chunk file.

To be able to do any debugging work I’ll need:

at minimum the chunk in question
the log files including the options dump printed at the start of execution
full details of the machines they were running on (including disks and controllers)

If you want to email me off list we can arrange a way to get the database files over to us.

If you have a commercial support arrangement it will be significantly faster to go through the service desk instead.

Regards,

James

Bindiya · May 13, 2014, 3:22pm

Thank you for replying. Im trying to get the corrupt chunk and other information for you. We did notice that its gone up to chunk 55 now and Im wondering if it is safe to keep using the system with a corrupt chunk?

Please advice,

Bindiya

Greg_Young1 · May 13, 2014, 3:41pm

It’s ok but should be resolved did you have power failures etc? Does the chunk pass the initial validation on start up?

Bindiya · May 13, 2014, 4:17pm

No it does not. I have skipDbVerify=true in the config file. The moment we start the GES, they get an error in the log for this chunk.

Thanks,

Greg_Young1 · May 13, 2014, 4:23pm

What if you don’t skip? Curious on what type of issue it may be

Bindiya · May 13, 2014, 4:49pm

They did not have power failures but a system reboot at 3 in the morning. After that the client started seeing all sorts of issues, so they restarted the GES and we saw the above ("System.ArgumentException: Log record at actual pos 33507792 has too large length: 1702195501 bytes, while limit is 16777216 bytes. Something is seriously wrong in chunk 54-").****

They are using the system heavily today and my concern is what went wrong with that chunk, since the log files are still complaining about somthing seriously going wrong with it.

Its a production environment so ill have to see if I can have them turn the skipDbVerify off.

Greg_Young1 · May 13, 2014, 4:53pm

Copy files and run check elsewhere

This restart is it Linux or windows do you have disk caching enabled?

jen20 · May 13, 2014, 5:35pm

It should be noted that if your client is using the software in a mission critical system, and using it heavily, that this is exactly what commercial support is designed for.

Are you able to give us information about the disks and configuration of the controllers?

Bindiya · May 13, 2014, 6:44pm

It is a Windows server. Disk Cache is enabled by default in Windows server.

If we do not skip, the server will use every ounce of memory in the system. with client data (55) chunks, server holding steady at 5GB of RAM usage.

With verify on, server pegs out at 16GB (total RAM in server).

Im working on getting you the answers to the other questions.

Thanks for the help.

Bindiya · May 13, 2014, 7:46pm

This is what I’ve found out:

The server the client is using is a VMware guest. We are unaware of the hardware platform.

The Disks are VMware Virtual SCSI Disks. Write Caching is enabled on the drives.

The server has 4 vCPUs and 16GB of RAM.

The volume containing the EventStore is a 100GB space with 47GB free.

This volume contains our application, the EventStore, and MongoDb files, and Lucene indices.

The Production EventStore is 15GB in size containing 55 chunks.

The EventStore Log folder is 1.64 GB in size containing 395 files.

If will take about 4~5 hrs to get the EventStore copied elsewhere to test the skipDbVerify = false. Will let you when thats done.

Thanks,

bindiya

Greg_Young1 · May 13, 2014, 9:15pm

You run windows server with disk caching enabled on a production database? Enabling Disk caching basically says to ignore us when we say that something should actually be on disk.

"If we do not skip, the server will use every ounce of memory in the system. with client data (55) chunks, server holding steady at 5GB of RAM usage.

With verify on, server pegs out at 16GB (total RAM in server)."

This is the windows file cache and you can control it (though ES doesn’t do this any more) http://superuser.com/questions/422113/how-could-i-limit-or-even-disable-file-cache-on-windows-server-2008r2 though a quick search will lead you to many more articles.

Greg

Bindiya · May 14, 2014, 2:03pm

Please let me know how I can send the chunk offline.

Thanks,

Bindiya

Greg_Young1 · May 15, 2014, 5:00pm

James will respond shortly with a ftp account.

"The server the client is using is a VMware guest. We are unaware of the hardware platform.

The Disks are VMware Virtual SCSI Disks. Write Caching is enabled on the drives.

The server has 4 vCPUs and 16GB of RAM."

My guess is that with disk caching enabled you have run into data corruption as a result of our operations being ignored somewhere in the stack (enabling caching in windows does much more than just enabling windows caching). In these cases a non-clean shutdown/power outage/etc can cause data corruption. I am unsure of what handling/options may be setup in VMWare as well (my guess there are some settings in regard to dealing with durability of disk writes). Normally VM software prefers to be fast as opposed to reliable by default as well and needs some settings enabled on it.

Cheers,

Greg

Bindiya · May 15, 2014, 5:07pm

Thank you so much Greg!

Is there a way to fix the chunk?

Thanks again,

Bindiya

Greg_Young1 · May 15, 2014, 5:10pm

Once I have them I can look into them and validate what I think likely happened. I won’t know if this is what happened or if it can be resolved (and if so how much data loss there possibly is) until I have received the chunk and had a day or two to investigate what is there.

In the future its imperative that you understand what your caching is between windows and the disks (especially with things like VMWare in the middle as they likely have their own settings)

Greg_Young1 · May 15, 2014, 5:44pm

What version of vmware are you running and if via a host as opposed to esx etc what OS is the host?

jen20 · May 15, 2014, 6:00pm

Bindiya,

Can you send over a link to the chunk using something like SendSpace?

Thanks,

James