Rebuilding indexes very slow, production client

jginzo · September 14, 2021, 1:23pm

We are experiencing a pretty critical issue with a client. This is a client with a pretty large db. At some point, EventStore crashed. I am not sure the specific error they received that crashed EventStore.

However, when they try to restart Eventstore, they are in a situation where it is rebuilding inexes.

We are using version 4.1.4.0.

To be clear, they left the database rebuilding indexes for weeks without it finishing.

I do have a log message:

[PID:04904:017 2021.03.13 00:00:51.416 DEBUG IndexCommitter ] ReadIndex Rebuilding: processed 2734447060 records (48.2%).

The machine has 32gb ram, but memory is pegged at 99% while this is going on.

Tomaz_Strukelj · September 22, 2021, 5:38pm

The server needs more memory for indexing process than for normal operation. Use a machine with more memory for creating the index.
If you don’t have it at your disposal then use a cloud server instance with 64GB memory or more, depending on how large is your db folder. Archive the db folder into a .tar.gz , upload it to the temporary instance , install the same 4.1.4.0 version and start it with the extracted db folder.

We’ve been in a bit similar situation with the same 4.1.4.0 version , the cluster started consuming a large amount of memory due to a bug that allowed a large event to be inserted and when master crashed due to out-of-memory, then the election caused another memory spike (a few GB within one second) for some reason - we’ve never observed that up to that point - and that just continued until we increased the instance size. We first thought the problem is with the index, so we started the reindexing process on one node while we restored the cluster with other 2 nodes.
With a 32GB instance, 350GB db folder and 20GB index size , it ran for 3 days, got to around 80% at which point it consumed whole 32GB . As we sorted out the cluster earlier with the old index this wasn’t an issue for us. We would need a 64GB server to recreate the index - we haven’t done that yet as it seems there’s no need for that.
I’ve observed how the indexing process creates smaller index files and then merges them together into bigger ones, so you can check if the indexing process is working or it got stuck - just check the index folder if there’s any activity there.

Have you lost the old index? maybe you have an old index in a backup, you could restore that one and Eventstore would reuse as much as it could and then continue indexing from that point onward. Do you have a single node or a cluster of 3?