How to resolve fatal errors of the type "Verification of chunk X failed, terminating server"

Justin_Thirkell · April 26, 2019, 7:29pm

Hi there,

We’ve got something we haven’t seen before on one of our ES clusters.

Verification of chunk #710-710 (chunk-000710.000004) failed, terminating server…
EventStore.Core.Exceptions.HashValidationException: Exception of type ‘EventStore.Core.Exceptions.HashValidationException’ was thrown.
at EventStore.Core.TransactionLog.Chunks.TFChunk.TFChunk.VerifyFileHash () [0x0010d] in :0
at EventStore.Core.TransactionLog.Chunks.TFChunkDb+c__AnonStorey0.<>m__0 (System.Object _) [0x0001e] in :0

It looks like these fatal logs started a few days ago;

ES fatal past 30 days.png

Not sure why it’s only affecting client apps now (clients complaining when the ES connection is dropped) but that’s beside the point - clients are recovering once the ES cluster is back healthy.

On the face of it the error message seems concerning, especially since it has caused two master elections in a few hours.

Interestingly the two master elections today means the original is now back as master. And looking at what’s changed on the cluster (which has otherwise been stable for the past 30 days), the appearance of fatal logs correlates with what appears to be a change to the cluster configuration - looks like a NPC was added on the 24th at around the same hour… The new master at that time was one promoted from one of the slaves so I’m confused why chunk verification has only started failing now.

ES nodes.png

So, questions;

Is it bad that we’re back on the original master? Presumably the problem is still there and I’ll get another blip in a few hours.
What is the chunk verification process? Does it run in the background? Does it run on all nodes or only the master? Am I seeing another master election after a few hours because chunk verification took that long to reach the bad chunk?
How do I fix this? Should I shoot the master instance, force a new instance to come up and hope the backup chunk files aren’t corrupted?
My main concern is whether the backup chunk files could have the same data corruption. If the backup chunk files have the same error, should I expect to see a delay of some hours before the chunk verification background process reaches the bad chunk and kills the ES process?

I’ve contacted GetEventStore support but there may be others out there who know answers to my questions.

cheers,

Justin

Laurence_Pike · April 26, 2019, 9:28pm

I’m sure the ES team will get back to you soon, but in the meantime…

Chunk verification is a background process that runs in the background. It happens once on start up.

In terms of recovery; short term it can be disabled by configuration option -SkipDbVerify. This will stop your node restarting so you can run.

The proper fix would be to correct the corrupt chunk. Assuming you’re running a cluster you can copy chunks of the same number between the nodes. Obviously the node you are copying to should be shutdown while you copy it over. This is predicated on the chunk corruption being a disk error or other failure on a single node.

If you are getting this error on all nodes for the same chunk you may have hit some unknown bug that I’m sure the ES team would be very interested in investigating. In that case a backup copy may or may not help.

Hope this helps in some way.

Cheers,

Laurence

Greg_Young1 · April 26, 2019, 9:36pm

This is the normal case, its basically saying it failed a hash validation. TFChunk files (well all major files) in ES are immutable and hashed. The hash check is failing. It is highly unlikely this is a bug and is more likely a disk issue etc (1 bit off fails). If you have the same chunk in a backup/other node in cluster just copy it over the existing one and it should be fine.

Cheers,

Greg

Justin_Thirkell · April 26, 2019, 11:07pm

Thanks both for the replies.

For the record, ES support came back pretty quickly with the same answer and replacing the bad chunk with a copy from our S3 backup did fix the problem.

The only action left is to fix our monitoring so we don’t run for several days with a node that’s thrashing and therefore not an actual participant in the cluster. There don’t appear to be any metrics on the stats endpoint that we could use to track continually crashing/restarting ES processes but alerting on fatal ES server logs will work also…

cheers,

Justin

Greg_Young1 · April 26, 2019, 11:11pm

Try /gossip