Corrupted chunk after reboot

nmehlei · May 29, 2015, 12:50pm

Hi,
We had to reboot our storage server (with Redis and EventStore installed) in one of our environments.

After the reboot, Event Store began rebuilding the index for all 39 chunks. Unfortunately, every time it arrives at chunk 31, it fails:

» 29 May 2015 14:41:32.227 [PID:03980:016 2015.05.29 12:41:27.894 DEBUG IndexCommitter ] ReadIndex Rebuilding: processed 174525 records (35.7%).

» 29 May 2015 14:41:32.227 [PID:03980:006 2015.05.29 12:41:28.081 FATAL TFChunkDb ] Verification of chunk #31-31 (chunk-000031.000001) failed, terminating server…

» 29 May 2015 14:41:32.227 EventStore.Core.Exceptions.HashValidationException: Exception of type ‘EventStore.Core.Exceptions.HashValidationException’ was thrown.

» 29 May 2015 14:41:32.227 at EventStore.Core.TransactionLog.Chunks.TFChunk.TFChunk.VerifyFileHash() in f:\Repos\EventStore\src\EventStore.Core\TransactionLog\Chunks\TFChunk\TFChunk.cs:line 495

» 29 May 2015 14:41:32.227 at EventStore.Core.TransactionLog.Chunks.TFChunkDb.<>c__DisplayClass1.b__0(Object _) in f:\Repos\EventStore\src\EventStore.Core\TransactionLog\Chunks\TFChunkDb.cs:line 144

» 29 May 2015 14:41:32.227 [PID:03980:006 2015.05.29 12:41:28.394 ERROR Application ] Exiting with exit code: 1.

» 29 May 2015 14:41:32.227 Exit reason: Verification of chunk #31-31 (chunk-000031.000001) failed, terminating server…

``

Is there anything one can do to fix this? As far as I can see, it was just a normal reboot and we don’t have write caching enabled so I am not sure how this corruption could’ve occured.

We are using the dev version from March 6th, 2015.

Thanks

Nicolas

nmehlei · May 29, 2015, 12:52pm

Ah found something that might be interesting/related, chunk #31 was the last one being scavenged (extension 000001), all following chunks have the extension 000000.

Greg_Young1 · May 29, 2015, 1:00pm

This is basically saying the md5 checksum mismatches from what was
originally calculated.

I don't know of any outstanding issues related to this but maybe
someone else has seen something @james?

You mention you are using dev. Dev has some changes that may have
something to do with this dealing with alignment changes happening
with the TF. You mention the other chunks (prior) are fine and were
also scavenged though?

nmehlei · May 29, 2015, 1:05pm

Would it make sense to replace the binaries with 3.0.4? Or rather stay with the older dev binaries until this issue is resolved?

The other chunks seem to be fine, they are all at ~227 MB (compared to the pre-scavenge 256 MB) and the log output shows only the mentioned error concerning chunk #31

Greg_Young1 · May 29, 2015, 1:07pm

So chunks from dev are as of now *not* backwards compatible with 3.0.x
(it will actually bring them forward). There is a change dealing with
alignment. Can you give me the exact size of the chunk files (one that
is working, random is fine and the one thats not).

nmehlei · May 29, 2015, 1:12pm

chunk-000000.000001 127,668,224 bytes should be working

chunk-000020.000001 237,568,000 bytes should be working
chunk-000031.000001 241,045,504 bytes mentioned in error, so likely corrupt

chunk-000032.000000 268,382,208 bytes first after corrupted chunk and also first that was not scavenged

chunk-000039.000000 268,435,712 bytes last chunk

Does that suffice?

Greg_Young1 · May 29, 2015, 1:17pm

OK so they are aligned.

Do you have another copy of chunk 32 sitting in a backup etc?

nmehlei · May 29, 2015, 1:23pm

Unfortunately not

nmehlei · May 29, 2015, 1:48pm

Is there hope in correcting this without a backup?

Greg_Young1 · May 29, 2015, 1:51pm

Possibly it depends how its broken.

You can bring up the database at this minute just by disabling the
check at start up (a read in that chunk if in corruption may give an
error but it should work overall). --skip-db-verify

Greg_Young1 · May 29, 2015, 1:52pm

Beyond that we need to figure out if its a bug (possible especially
off dev as not all changes there have been through long term testing)
or an actual corrupt chunk.

nmehlei · May 29, 2015, 2:07pm

–skip-db-verify allowed starting the EventStore and as far as I can see, all events and streams seem to be there.

What would be my next steps to:

“Make sure” everything is really there, correct and consistent? (if that’s even possible since there is no frame of reference)
Finding out what went wrong. Would it help if I send you the chunks, the exact used binary and my configuration?
Migrate to the stable (3.0.4) branch? Since - like you said - the files are not directly-compatible, I’d imagine some kind of replication-based migration would have to be used.

PS: And yes, I am already working on a backup script

Thanks

Greg_Young1 · May 29, 2015, 2:10pm

Backup script is a good start. If you for instance try to scavenge
that chunk you may get interesting behaviour (we have not fixed
anything we have just said "ignore that issue"). It could be data is
slightly off somewhere in the chunk or it could be just a
miscalculation. I think the right place to start is to take a backup
at this point (btw if running clustered the chunks are the same on all
nodes). From there lets bring up that db in say a dev environment. I
would be curious to know if you scavenge what errors may come up on
that chunk (another option is to try bringing up a from all
subscription which will try to read every event.