We are experiencing very long hash verification times when running this in Windows Azure using an attached data disk:
[PID:1268 2013.04.22 11:06:55.499 Trace TFChunk 1]
Verifying hash for TFChunk ‘f:\eventdb\chunk-000000.000000’…
[PID:1268 2013.04.22 11:14:20.712 Trace TFChunk 1]
Verifying hash for TFChunk ‘f:\eventdb\chunk-000001.000000’…
[PID:1268 2013.04.22 11:22:11.615 Trace TFChunk 1]
Verifying hash for TFChunk ‘f:\eventdb\chunk-000002.000000’…
[PID:1268 2013.04.22 11:29:14.411 Trace TFChunk 1]
Verifying hash for TFChunk ‘f:\eventdb\chunk-000003.000000’…
[PID:1268 2013.04.22 11:36:18.104 Trace TFChunk 1]
Verifying hash for TFChunk ‘f:\eventdb\chunk-000004.000000’…
This is causing the instance to not respond to requests for quite a bit of time (which brings everything crashing down). Any idea on a.) how to speed this up? or b.) how to skip this step?
Not sure if skipping this step would be safe, but the verification is taking way too long right now.
Wow minutes to compute a hash on a 255mb file?! are you sure this is a
local disk? What is the processor.
This step can be skipped see --skip-db-verify
https://github.com/EventStore/EventStore/wiki/Running-the-Event-Store#on-windows-and-net
Are these sequential messages in the log? These are all for the same chunk.
Those are each different files (eg chunk-00000x.000000, where x is different). They are 255MB each and they are on an attached data disk, which is a pass through disk to blob storage. The performance I am sure is related to this, but it is particularly bad here.
Machine size is S(mall) - so 1.7 Ghz and ~2GB RAM.
I have not tried with blob storage before but it sounds quite slow
even if its pulling over 255mb at a time.
I might also guess that since the files are immutable, if you dont
take them over during the checksuming (likely cached after) you will
probably see slowdowns some place else. 2+ minutes to download 255mb
seems a bit on the high side though.
Cheers,
Greg
Try creating Striped Volume (4 disks or more), that may give you additional performance.
Also make sure you Storage Account is Gen 2.
It’s a newer account with the higher thresholds. I don’t want to stripe disks at this point because it just means more complexity and more disks that could get corrupted.
From what I can tell, it looks like the event store db was corrupted and we ended up losing all our data. I still have our entire folder of files that is dead now if anyone wants to investigate.
How did this go from long verificiation times to the event store db
being corrupted? Did I miss an email somewhere?
Even on a corrupt db we can extract probably 99% of the information
from the db using internal utilities we have.
I am a bit confused what happened in this thread, can you clarify?
Greg
You didn’t miss anything. You answered the question. My response was more for Yevhen regarding why we are not going to stripe disks in azure. The perf is more than adequate when not doing a hash verification.
The original intent here was to figure out how we could avoid these super long verification times. Every time our VM was rebooted, our service was down for a long time while this hash verification occurred. You answered that we could skip it - so, I did. That showed me that something else was wrong - the eventstore server no longer responded, everything just timed out. We ended up having to create a new db and rebuild the entire eventstore again from our views (held elsewhere luckily). That was the only way that the eventstore became responsive again. I am just assuming at this point it is because the db was corrupted. That might be a bad assumption.
Nothing that interesting that I can find. For that particular day, the only thing I found that was not in other logs was:
ReadIndex Rebuilding: processed 100000 records.
Perhaps it was stuck rebuilding the index when I gave up and just created a new eventstore db.
It does that every time you restart