Hi,
We have a 3 node setup of Event Store (each node on different machine), and our data directory on each machine is over 1TB. Just recently we encountered an issue with corrupted database error. 2 of our nodes stopped with the following exception and threw it continuously when restarted
Unhandled exception while starting application:
EventStore.Core.Exceptions.CorruptDatabaseException: Corrupt database detected. —> EventStore.Core.Exceptions.ChunkNotFoundException: E:\EventStore\data-EventStoreNode1\chunk-000000.000000 not found.
— End of inner exception stack trace —
We had a scavenge process running on this cluster and I can tell from the suffixes of chunk files (.000001) that it went through on healthy node and couldn’t be finished on failed nodes (lots of {guid}.scavenge.tmp files). Right before those exceptions started to occur I can see other exceptions in the logs:Couldn’t
acquire exclusive lock on DB at ‘E:\EventStore\data-EventStoreNode1’. on 1st node and ‘there
is not enough disk space on disk’ on 2nd node.
And in fact we run out of disk space at that time and lock might be related to Antivirus activity. **Is there a bug in handling shortage of disk space or/and handling locked files? And is there a way to recover from corrupt database detected? **
Currently we just removed the data from those failed nodes and let them be repopulated from the healthy one but I can imagine a situation that all 3 will fail because of the issues I mentioned.