Exception on Node Startup while opening Chunks

daniel.steiner · June 8, 2020, 9:32am

Recently we are experiencing frequent Eventstore node crashes which takes a lot of manual monitoring and restarting. Usually a restart of the service solves the issue, but now we have a node that we can’t start without restoring a backup…
(We assume, that most of our node crashes occur because we are running scavenges, which are very important for us because of disk usage…)

Our Setup:

Ubuntu 1604 on GCP
5 Node Cluster
Eventstore 5.0.4
Over 9000 Chunks
Scavenge running on each node to keep disk space “low”

The exception we’re getting is “Chunk #8338 is not present in DB.”. I tried to understand where and why the exceptions occurs in the source code. It seems to me, that the node assumes the highest chunk is #8338 instead of #9161 and this crashes the whole startup process.

Did anyone else encounter this problem before, and is there a solution to it other than having a backup?

A bit more log:
Jun 08 07:17:39 eventstored[10978]: Opened completed “/mnt/disks/esdata/db/chunk-008333.000001” as version 3
Jun 08 07:17:39 eventstored[10978]: Opened completed “/mnt/disks/esdata/db/chunk-008334.000001” as version 3
Jun 08 07:17:40 eventstored[10978]: Opened completed “/mnt/disks/esdata/db/chunk-008335.000001” as version 3
Jun 08 07:17:40 eventstored[10978]: Opened completed “/mnt/disks/esdata/db/chunk-008336.000001” as version 3
Jun 08 07:19:07 eventstored[10978]: Removing excess chunk version: “/mnt/disks/esdata/db/chunk-008337.000000”…
Jun 08 07:19:07 eventstored[10978]: Unhandled exception while starting application:
Jun 08 07:19:07 eventstored[10978]: EXCEPTION OCCURRED
Jun 08 07:19:07 eventstored[10978]: Chunk #8338 is not present in DB.
Jun 08 07:19:07 eventstored[10978]: Parameter name: chunkNum
Jun 08 07:19:07 eventstored[10978]: “Chunk #8338 is not present in DB.
Jun 08 07:19:07 eventstored[10978]: Parameter name: chunkNum”
Jun 08 07:19:07 eventstored[10978]: EXCEPTION OCCURRED
Jun 08 07:19:07 eventstored[10978]: Chunk #8338 is not present in DB.
Jun 08 07:19:07 eventstored[10978]: Parameter name: chunkNum
Jun 08 07:19:08 systemd[1]: eventstore.service: Service hold-off time over, scheduling restart.