Corrupt database detected error

We have a 3 node setup of Event Store (each node on different machine), and our data directory on each machine is over 1TB. Just recently we encountered an issue with corrupted database error. 2 of our nodes stopped with the following exception and threw it continuously when restarted

Unhandled exception while starting application:

EventStore.Core.Exceptions.CorruptDatabaseException: Corrupt database detected. —> EventStore.Core.Exceptions.ChunkNotFoundException: E:\EventStore\data-EventStoreNode1\chunk-000000.000000 not found.

— End of inner exception stack trace —

We had a scavenge process running on this cluster and I can tell from the suffixes of chunk files (.000001) that it went through on healthy node and couldn’t be finished on failed nodes (lots of {guid}.scavenge.tmp files). Right before those exceptions started to occur I can see other exceptions in the logs:Couldn’t
acquire exclusive lock on DB at ‘E:\EventStore\data-EventStoreNode1’.
on 1st node and ‘there
is not enough disk space on disk’
on 2nd node.

And in fact we run out of disk space at that time and lock might be related to Antivirus activity. **Is there a bug in handling shortage of disk space or/and handling locked files? And is there a way to recover from corrupt database detected? **

Currently we just removed the data from those failed nodes and let them be repopulated from the healthy one but I can imagine a situation that all 3 will fail because of the issues I mentioned.

This error:

Couldn’t acquire exclusive lock on DB at 'E:\EventStore\data-EventStoreNode1’

Usually happens at startup a mutex is held by the process to protect against two instances writing to the same data files as you can imagine this would cause some issues having multiple processes writing to the same files.

**‘there is not enough disk space on disk’ **

Is pretty self explanatory

I have some follow-up questions..

1. Is there a way to manually recover from this? E.g. By manipulating the scavenge output files.

2. Is this the result of the scavenge process being interfered with by another process (i.e. antivirus locking the files)?

3. Is this the result of running out of disk space?

4. Is this 2. and 3. at the same time?

Agreed. The best repair steps when this happens is all I need as well. It has happened twice, but since I am in early dev, i just started over.


Regarding 1. Assuming more disk space was added after the crash, would there then be some manual steps to recovery?