Can the EventStoreDB recover from the out of disc space issue?

fipil · January 14, 2022, 11:55am

Hello,
we use the EventStore v 20.10 in our test development environment. We run it in one instance on linux VM, in docker, we use the image eventstore:20.10.0-buster-slim from the dockerhub registry. We run it as a single node, no cluster.
Yesterday we encountered an out of disc space on the VM, the issue was there for some time. We made some free space on the disc but the EventStore database seems corrupted after the issue. It wrotes the following problems to log:

[ 1,15,09:26:43.353,DBG] ReadIndex Rebuilding: processed 600000 records (91.4%)., [ 1,15,09:26:43.636,FTL] Error in StorageChaser. Terminating..., System.Exception: Prefix/suffix length inconsistency: prefix length(6049) != suffix length (0)., Actual pre-position: 200271223. Something is seriously wrong in chunk #23-23 (chunk-000023.000000)., at EventStore.Core.TransactionLog.Chunks.TFChunk.TFChunk.TFChunkReadSide.TryReadForwardInternal(ReaderWorkItem workItem, Int64 actualPosition, Int32& length, LogRecord& record) in /build/src/EventStore.Core/TransactionLog/Chunks/TFChunk/TFChunkReadSide.cs:line 510, at EventStore.Core.TransactionLog.Chunks.TFChunk.TFChunk.TFChunkReadSideUnscavenged.TryReadClosestForward(Int64 logicalPosition) in /build/src/EventStore.Core/TransactionLog/Chunks/TFChunk/TFChunkReadSide.cs:line 82, at EventStore.Core.TransactionLog.Chunks.TFChunkReader.TryReadNextInternal(Int32 retries) in /build/src/EventStore.Core/TransactionLog/Chunks/TFChunkReader.cs:line 82, at EventStore.Core.Services.Storage.ReaderIndex.IndexCommitter.Init(Int64 buildToPosition) in /build/src/EventStore.Core/Services/Storage/ReaderIndex/IndexCommitter.cs:line 130, at EventStore.Core.Services.Storage.IndexCommitterService.Init(Int64 chaserCheckpoint) in /build/src/EventStore.Core/Services/Storage/IndexCommitterService.cs:line 89, at EventStore.Core.Services.Storage.StorageChaser.ChaseTransactionLog() in /build/src/EventStore.Core/Services/Storage/StorageChaser.cs:line 107, [ 1,15,09:26:43.637,ERR] Exiting with exit code: 1., Exit reason: "Error in StorageChaser. Terminating...\nError: Prefix/suffix length inconsistency: prefix length(6049) != suffix length (0).\nActual pre-position: 200271223. Something is seriously wrong in chunk #23-23 (chunk-000023.000000).", [ 1, 9,09:26:43.654,WRN] Error occurred while releasing lock., System.ApplicationException: Object synchronization method was called from an unsynchronized block of code., at System.Threading.Mutex.ReleaseMutex(), at EventStore.Core.ExclusiveDbLock.Release() in /build/src/EventStore.Core/ExclusiveDbLock.cs:line 52, [ 1,12,09:26:43.668,INF] ========== ["0.0.0.0:2113"] IS SHUTTING DOWN..., [ 1,12,09:26:43.702,INF] ========== ["0.0.0.0:2113"] Service '"StorageWriter"' has shut down., [ 1,12,09:26:43.702,INF] ========== ["0.0.0.0:2113"] Service '"StorageReader"' has shut down., [ 1,12,09:26:43.702,INF] ========== ["0.0.0.0:2113"] Service '"HttpServer [0.0.0.0:2113]"' has shut down., [ 1,19,09:26:43.705,DBG] Persistent subscriptions received state change to ShuttingDown. Stopping listening, [ 1,12,09:26:43.714,INF] ========== ["0.0.0.0:2113"] Service '"ReplicationTrackingService"' has shut down., [ 1,12,09:26:43.741,INF] ========== ["0.0.0.0:2113"] Service '"Storage Chaser"' has shut down., [ 1,12,09:26:43.741,INF] ========== ["0.0.0.0:2113"] All Services Shutdown., [ 1,12,09:26:43.785,INF] ========== ["0.0.0.0:2113"] IS SHUT DOWN.,

We have already seen this error a couple of months ago, and it was probably a similar situation (out of disc space).
Please I’ve the following questions:

Can ve recover the database from the “System.Exception: Prefix/suffix length inconsistency” error?
The EventStore really does not alive the out of disc space issue, so it can corrupt a dabase to unrecoverable state?
How to prevent this situation? Does a cluster multi-node deployment prevents the situation? (Of course we should prevent the out of disc space, but do you have any other advices about deployment scenarios etc?)
Any other advices, please?

So far we only run EventStore in a test environment, so this is not so critical, but I can’t imagine that something like this will happen to us in a production.

Thank you for any help.
Filip Nowak

henri · March 5, 2023, 4:43pm

Did you find a solution for this ?

No one have a solution ?

hayley.campbell · March 6, 2023, 9:50am

Hi Henri,

The issue that causes the corruption has been fixed in 22.10.1, so we recommend that you upgrade to this version in order to prevent it from happening.

As for fixing a database that has run into this, the data that is corrupted at the end of the chunk was never acked back to the client, so it can be safely removed. For more information, you can check the issues on github: https://github.com/EventStore/EventStore/issues/3642

If you are running in a cluster, we recommend restoring a backup of another node to the one that ran out of space.

If you are running a single node, then you need to truncate the corrupted data. You can do this by copying the chaser.chk file over the truncate.chk file (as if you were restoring a backup), and then starting the node.