Restart failed — Record too large

Hello,

Why does this crash? EventStore runs as a docker container on Kubernetes as a stateful set, i.e. only one instance can run at a time. It’s been working well for months, and now that we upgraded the cluster we got this error when restarting the node:

// snip
[00001,11,11:33:14.231] Verifying hash for TFChunk ‘/var/lib/eventstore/chunk-000000.000000’…
[00001,25,11:33:18.791] ReadIndex Rebuilding: processed 317801 records (36.9%).
[00001,25,11:33:23.821] ReadIndex Rebuilding: processed 317921 records (37.1%).
[00001,25,11:33:28.824] ReadIndex Rebuilding: processed 320211 records (40.7%).
[00001,25,11:33:33.824] ReadIndex Rebuilding: processed 323394 records (45.6%).
[00001,25,11:33:38.825] ReadIndex Rebuilding: processed 326922 records (50.4%).
[00001,25,11:33:43.826] ReadIndex Rebuilding: processed 330190 records (55.3%).
[00001,25,11:33:48.829] ReadIndex Rebuilding: processed 333681 records (60.3%).
Exiting with exit code: 1.
Exit reason: Error in StorageChaser. Terminating…
Error: Record too large.
[00001,25,11:33:52.046] Error in StorageChaser. Terminating…
Record too large.
[00001,25,11:33:52.085] Exiting with exit code: 1.
Exit reason: Error in StorageChaser. Terminating…
Error: Record too large.

What sort of error condition is this?

Regards

This looks like a corrupt db

Right, but.

  • Why is it corrupt?
  • Has it got something to do with your note not to run ES in the docker container in production? What are the failure conditions you’ve seen that caused you to write “not for production” on the docker container?
  • This database has been up for 351 days without any issues; why does it happen now?
  • What does even “Error: Record too large.” mean? Is it an event that’s too large? In that case, what even in what stream?
  • We have done a quick-and-dirty way of storing some things in EventStore, which means that we’ve written even large values into streams. Could it be that it’s one of these values that’s too large and is exposing an issue/bug in ES?

We’re using ES 3.9.3 — which is the latest v3 release

Further;

  • We may have restarted the node while it was reindexing/verifying — could that corrupt the DB? Have you load-tested restarting/terminating ES while it’s reindexing?

Most likely IMO it is corruption due to docker not really supporting durability in all cases. What type of disk is mapped to the instance and which file system?

The problem is its turtles all the way down figuring whether an fsync actually works. When you have 5 levels of virtualization determining which one is the issue is quite tedious.

Type

SSD persistent disk

Hi,

See the attached for the config.

The disk is this on Google Cloud/Compute:

Size

40 GB

Zone

europe-west1-c

Estimated performance

Operation Type
Read
Write
Sustained random IOPS limit
1,200.00
1,200.00
Sustained throughput limit (MB/s)
19.20
19.20

Encryption

Automatic

eventstore-censored.yaml (2.65 KB)

So given this; how can we be sure this setup has a problem with fsync?

Why does the DB get corrupted by not having fsync go through? In my world-view this would just cause loss of data that’s been ACKed, not outright corruption of the data.

“Why does the DB get corrupted by not having fsync go through?”

Because we call a system call that says “put this on disk” and the OS says “ok I did that” when it actually has not.

All I’m saying is that while that may be true, it should not corrupt the whole DB; only lose data.

And I’m also asking how I fix it; what streams are affects — basically I’m asking for operational tooling, now that this has indeed happened. Since this DB in particular had a very low write-load (next to nothing), I’m fairly sure it’s a problem with one of the statistics streams and not a data stream. That’s why I’m asking the above barrage of questions (yet to be fully clarified):

  • Has it got something to do with your note not to run ES in the docker
    container in production? What are the failure conditions you’ve seen that caused you to write “not for production” on the docker container?
  • This database has been up for 351 days without any issues; why does it happen now?
  • What does even “Error: Record too large.” mean? Is it an event that’s too large? In that case, what even in what stream?

PS; we haven’t concluded it’s actually an fsync error.

We have been running tests for years showing that our data is actually durable on varying hardware including power outage etc tests. I would be amazed if it was a problem https://eventstore.org/blog/20130708/testing-event-store-ha/

Yes, alright, but since we’re running in the cloud and not on hardware, that is not enough for us. We’re not going to purchase hardware any day soon.

Beyond the obvious questions above (how to fix it, diagnose what streams, etc), then it would also be relevant to know how to fully ensure fsync syncs to durable storage on google cloud? I’ve directed Google to this thread; perhaps they can shine some light on it?

We use an OS level call to ensure writes. The problems with the call are well documented as some underlying systems lie. There is little we can do about a system under us that lies.

If you want you can try this in hardware as well as cloud environments. We have found cases where local hardware lies and says it did an operation when it really cached the operation. Old documentation but: https://github.com/EventStore/EventStore/wiki/Reliability

Ok, but leaving aside the correctness (or not) of EventStore, how do I:

  1. See what streams are affected?
  2. Breathe life into the DB?
  3. Ensure this never happens in production?

I hope there can be some sort of solution to this, come monday.

I would also love to hear what you can do when a problem like this occur.

Do you have to create backups to protect yourself from loosing data in a scenario like this? Is it no way to back track to the last well known state?

We are evaluating event store as well, but if there is no real way to recover from this problem it might be a deal break for us.

We’ve had this issue in production, there was significant data loss, everything after the faulty event.

Same thing with corrupt database due to windows write cache in an environment we didn’t control.

These things happen, I think it would be very beneficial to have some kind of disaster recovery mode in eventstore, where everything but the corrupt stream could be salvaged…

/Peter

Its pretty easy to handle actually and we have done it in the past but any database will run into such issues running in an environment where disks are not actually durable. This is why we do power pull testing etc (to make sure we are durable). It does not matter whether its a disk with a cache that lies, an OS that lies, or a docker container with non-durable disks. These issues apply to literally any database. As example during testing https://eventstore.org/blog/20130708/testing-event-store-ha/ we found intel 320/710 drives to be durable but intel 520 drives to not be durable. There are tools out there to test disk durability (basically a program you run one the machine which handles writes and a client program on another machine that keeps track of things). For SSDs a good hint is whether the disk has a super-capacitor on it (kind of like battery backup for spindle drives).

Of course in a cluster the answer is as simple as just re-replicate the data.

As a laugh during the buildstuff conference a couple of years ago (not that I would be busy in the middle of running a conference) Azure storage had a major outage and lost writes. Poor James Nugent and I were fixing databases manually most the day :slight_smile:

Sure, I’ve had dataloss on SQL Server, but never to this extent. And, I have been able to fix it myself, with the provided tools.

(the server crash I mentioned above had SQL Server running on it as well, but only ES was affected. In fact, I think it’s been 10-15 years since had ha to repair a corrupt SQL Server database. Could be that MS acutally bypass their own write cache. I know I would have…).

We’ve had great help from support on the rare occasions we had problems, so it’s not a major issue. But I still think we could benefit from more advanced tooling, perhaps manually migrating missing streams after a first disaster recovery and you’re up and running again.

/Peter