Scavenge error: "Could not read latest stream's prepare"

Simonas_Stundzia · August 1, 2016, 3:00pm

Hi,

every node in our event store cluster throws the following error when attempting to scavenge:

[PID:32164:081 2016.08.01 14:28:25.563 INFO AdminController ] Request scavenging because /admin/scavenge request has been received.

[PID:32164:076 2016.08.01 14:28:25.563 TRACE TFChunkScavenger ] SCAVENGING: started scavenging of DB. Chunks count at start: 178. Options: alwaysKeepScavenged = False, mergeChunks = True

[PID:32164:076 2016.08.01 14:28:25.564 TRACE TFChunkScavenger ] SCAVENGING: started to scavenge & merge chunks: #0-23 (chunk-000000.000003)

Resulting temp chunk file: 347cd0de-e731-4b3b-97d1-7aa52802aceb.scavenge.tmp.

[PID:32164:076 2016.08.01 14:28:47.767 ERROR StorageScavenger ] SCAVENGING: error while scavenging DB.

System.Exception: Could not read latest stream’s prepare. That should never happen.

at EventStore.Core.Services.Storage.ReaderIndex.IndexReader.GetStreamLastEventNumberUncached (TFReaderLease reader, System.String streamId) [0x00000] in :0

at EventStore.Core.Services.Storage.ReaderIndex.IndexReader.GetStreamLastEventNumberCached (TFReaderLease reader, System.String streamId) [0x00000] in :0

at EventStore.Core.Services.Storage.ReaderIndex.IndexReader.EventStore.Core.Services.Storage.ReaderIndex.IIndexReader.GetStreamLastEventNumber (System.String streamId) [0x00000] in :0

at EventStore.Core.Services.Storage.ReaderIndex.ReadIndex.EventStore.Core.Services.Storage.ReaderIndex.IReadIndex.GetStreamLastEventNumber (System.String streamId) [0x00000] in :0

at EventStore.Core.TransactionLog.Chunks.TFChunkScavenger.ShouldKeepPrepare (EventStore.Core.TransactionLog.LogRecords.PrepareLogRecord prepare, System.Collections.Generic.Dictionary`2 commits, Int64 chunkStart, Int64 chunkEnd) [0x00000] in :0

at EventStore.Core.TransactionLog.Chunks.TFChunkScavenger+c__AnonStorey1.<>m__1 (EventStore.Core.TransactionLog.LogRecords.PrepareLogRecord prepare) [0x00000] in :0

at EventStore.Core.TransactionLog.Chunks.TFChunkScavenger.TraverseChunk (EventStore.Core.TransactionLog.Chunks.TFChunk.TFChunk chunk, System.Action1 processPrepare, System.Action1 processCommit, System.Action`1 processSystem) [0x00000] in :0

at EventStore.Core.TransactionLog.Chunks.TFChunkScavenger.ScavengeChunks (Boolean alwaysKeepScavenged, IList`1 oldChunks, System.Int64& spaceSaved) [0x00000] in :0

at EventStore.Core.TransactionLog.Chunks.TFChunkScavenger.Scavenge (Boolean alwaysKeepScavenged, Boolean mergeChunks) [0x00000] in :0

at EventStore.Core.Services.Storage.StorageScavenger.Scavenge (EventStore.Core.Messages.ScavengeDatabase message) [0x00000] in :0

This error occurs on old 3.0.5 nodes AND on new 3.8.1 nodes (All nodes run on CentOS el7).

Note: new nodes have data files copied from old nodes since “caching up” fails after some time

What would you suggest as a course of actions to fix this scavenge error? We would like to recover some 30+ GB of hard drive space.

Greg_Young1 · August 1, 2016, 3:09pm

Have you previously had crashes and done partial restores etc?

How big are these nodes?

"Note: new nodes have data files copied from old nodes since "caching
up" fails after some time"

I have no idea what this means but I am guessing you mean replicating?
Nobody has ever put an issue of a node catching up in the past (even
for multi-tb databases).

Simonas_Stundzia · August 1, 2016, 3:27pm

I believe crashes tend to happen from time to time because every now and then we find one of the nodes in Prereplica state. Just restarting node never helped and we finally gave up looking for proper solution and would just wipe the data and allow it to catch up again. This time however even catching up does not help (with --gossip-timeout-ms=5000). Based on my personal observations this may have something to do with master elections kicking in while prereplica is catching up. Sometimes epochs start to spin up fairly fast (mostly 3+ epochs per ~5s refresh).

Partial restores have never been done, however the new nodes (3.8.1) have the database folder completely copy/pasted from an older node (3.0.5) (since catching up did not work).

Chunk verification on node start seems to complete successfully (no errors in log file).

As for node size:

37 GB Data folder
153 chunk files most of which are 256-257 MB in size

Simonas_Stundzia · August 2, 2016, 11:20am

I have no idea what this means but I am guessing you mean replicating?

Nobody has ever put an issue of a node catching up in the past (even

for multi-tb databases).

Sorry. Missed this part. Yes - I mean replication.

We mentioned this catching up problem once before: https://groups.google.com/forum/#!topic/event-store/g_Dxx8z4f7g . At that time setting --gossip-timeout-ms=5000 solved the problem since it was on a crappy environment that has network issues.

Anyway. As for the scavenge problem - today i’ve tried a different setup for catching up. I’ve made a new node (with copied data files from 3.0.5, but installed node version is now 3.8.1) as a master and then let another 3.8.1 node to replicate. This time replication appeared to be resilient because it replicated data even when master node restarted and another node got elected as a new master.

However even after replication got completed I’ve got the following scavenge output:

Status
Space Saved (bytes)
Time Taken
Result
Completed
0
00:10:04.2029699
Failed: Error while scavenging DB: Could not read latest stream’s prepare. That should never happen…
Scavenging chunks 26 - 27 complete
9,717,884
00:02:16.4805053
2 chunk(s) scavenged
Scavenging chunks 25 - 25 complete
0
00:01:51.6999932
No chunks scavenged
Scavenging chunks 24 - 24 complete
0
00:01:57.7795314
No chunks scavenged
Scavenging chunks 0 - 23 complete
507
00:03:07.2709727
24 chunk(s) scavenged
Started

Logs state the same message as pasted in the first message of this thread.

I tried this replication hoping that maybe the database files would get fixed up in the process.

It may be worth mentioning that I have deleted roughly 2.3 million streams before trying the scavenge last week.

Would love to head any ideas/suggestions of what I could try next.

Simonas @ Adform

Greg_Young1 · August 23, 2016, 12:00pm

See #1006 to be released in 3.9.0 this week. It will resolve the immediate issue then we can look at what might be a root cause.

Simonas_Stundzia · August 29, 2016, 7:55am

Download link for 3.9.0 appears to be broken. All I get is “File Not Found” error page. 3.8.1 download does work though.

pieter.germishuys · August 29, 2016, 7:58am

We are aware of the problem and looking into it, thank you for reporting the issue though.

pieter.germishuys · August 29, 2016, 8:10am

The issue has been resolved, please let us know if you have any further issues.

Simonas_Stundzia · August 29, 2016, 12:00pm

Hi Greg,

Today I’ve upgraded one node to 3.9.0 and ran scavenge process on it. The result is ~19.4 GB disk space gain, so scavenge seems to somewhat work. It did however post 27 entries of “Error: Could not read latest stream’s prepare. That should never happen.” into scavenge stream. Log contains exactly the same exception stack trace as one in the original post.

Is there any information you need me to provide to further investigate this issue? Do you need the actual chunk file(s)? If so - where do I send a link?

Thanks for your help so far.

Greg_Young1 · August 29, 2016, 12:03pm

Was there anything else in the log?

Simonas_Stundzia · August 29, 2016, 12:07pm

Does not appear to be. Here is a snippet of the log (including last log entry for previous chunk and first entry for next chunk) just in case you see something I do not:

[PID:12062:018 2016.08.29 10:57:26.475 TRACE TFChunkScavenger ] Scavenging of chunks:

#162-162 (chunk-000162.000000)

completed in 00:00:29.3533228.

New chunk: 48eed75f-0171-49b2-b320-940da4656109.scavenge.tmp --> #162-162 (chunk-000162.000001).

Old chunks total size: 268420871, scavenged chunk size: 5772997.

[PID:12062:018 2016.08.29 10:57:26.476 TRACE TFChunkScavenger ] SCAVENGING: started to scavenge & merge chunks: #163-163 (chunk-000163.000000)

Resulting temp chunk file: f853664a-170d-4f0a-b4d5-608585d9791f.scavenge.tmp.

[PID:12062:018 2016.08.29 10:57:57.917 INFO TFChunkScavenger ] Got exception while scavenging chunk: #163-163. This chunk will be skipped

Exception: System.Exception: Could not read latest stream’s prepare. That should never happen.

at EventStore.Core.Services.Storage.ReaderIndex.IndexReader.GetStreamLastEventNumberUncached (TFReaderLease reader, System.String streamId) [0x00000] in :0

at EventStore.Core.Services.Storage.ReaderIndex.IndexReader.GetStreamLastEventNumberCached (TFReaderLease reader, System.String streamId) [0x00000] in :0

at EventStore.Core.Services.Storage.ReaderIndex.IndexReader.GetStreamMetadataUncached (TFReaderLease reader, System.String streamId) [0x00000] in :0

at EventStore.Core.Services.Storage.ReaderIndex.IndexReader.GetStreamMetadataCached (TFReaderLease reader, System.String streamId) [0x00000] in :0

at EventStore.Core.Services.Storage.ReaderIndex.IndexReader.EventStore.Core.Services.Storage.ReaderIndex.IIndexReader.GetStreamMetadata (System.String streamId) [0x00000] in :0

at EventStore.Core.Services.Storage.ReaderIndex.ReadIndex.EventStore.Core.Services.Storage.ReaderIndex.IReadIndex.GetStreamMetadata (System.String streamId) [0x00000] in :0

at EventStore.Core.TransactionLog.Chunks.TFChunkScavenger.IsSoftDeletedTempStreamWithinSameChunk (System.String eventStreamId, Int64 chunkStart, Int64 chunkEnd) [0x00000] in :0

at EventStore.Core.TransactionLog.Chunks.TFChunkScavenger.ShouldKeepPrepare (EventStore.Core.TransactionLog.LogRecords.PrepareLogRecord prepare, System.Collections.Generic.Dictionary`2 commits, Int64 chunkStart, Int64 chunkEnd) [0x00000] in :0

at EventStore.Core.TransactionLog.Chunks.TFChunkScavenger+c__AnonStorey1.<>m__1 (EventStore.Core.TransactionLog.LogRecords.PrepareLogRecord prepare) [0x00000] in :0

at EventStore.Core.TransactionLog.Chunks.TFChunkScavenger.TraverseChunk (EventStore.Core.TransactionLog.Chunks.TFChunk.TFChunk chunk, System.Action1 processPrepare, System.Action1 processCommit, System.Action`1 processSystem) [0x00000] in :0

at EventStore.Core.TransactionLog.Chunks.TFChunkScavenger.ScavengeChunks (Boolean alwaysKeepScavenged, IList`1 oldChunks, System.Int64& spaceSaved) [0x00000] in :0 .

[PID:12062:018 2016.08.29 10:57:57.919 TRACE TFChunkScavenger ] SCAVENGING: started to scavenge & merge chunks: #164-164 (chunk-000164.000000)

Resulting temp chunk file: 97f5149f-8eb8-482c-a91c-7565c3a180d8.scavenge.tmp.

pieter.germishuys · August 29, 2016, 12:44pm

I think it would be easier to debug/root cause the issue having access to a backup of the db, but the db seems a bit large. Did you have a plan in mind as to how you were going to share your db with us?
Would you be in a position to run a development build of a node against a backup of your db?

Simonas_Stundzia · August 29, 2016, 12:52pm

Did you have a plan in mind as to how you were going to share your db with us?

I was thinking about uploading one chunk to my Google Drive account and sharing a link to it. Whole database (22 GB data folder) will not fit. If I had to share the whole thing - I could temporarily upload it to an FTP server instead.

Would you be in a position to run a development build of a node against a backup of your db?

I could do that no problem. I might be a bit slow with updates though as this is not my primary task at the moment. This has happened in our Development environment and is not affecting Prod environment at this time. We would however like to know what caused it and how to fix it in case it happens in prod.

Simonas_Stundzia · August 31, 2016, 1:46pm

Just FYI: full database 7-Zip’ed takes 3.7 GB of disk space.