Restart of Eventstore taking very long on AKS

Markus_Dobel · October 11, 2019, 5:07pm

Hi,

we’re running Eventstore 4.1.3 on Azure Kubernetes Service (AKS), and with the increasing amount of data, restarting Eventstore takes longer and longer – with something between 30 minutes up to 2 hours for a node for about ~120GB of data. With an Eventstore cluster size of 5, rolling out configuration changes or updating nodes (Eventstore or underlying Kubernetes nodes) currently takes around 6 hours or more currently.

While this seems largely related to performance characteristics of the underlying “Premium Disks” (or the disk access patterns of Eventstore not being optimal for Azure, as the Azure support sees it), I wonder how to work with this, and what options I have to improve on the given environment.

Now I took a look at the various configuration options Eventstore gives me that might (or might not) affect startup speed specifically (runtime performance being the second to analyze later)

On a development server, the following options increased startup time dramatically (back to just a few minutes from start to being ready).

EVENTSTORE_OPTIMIZE_INDEX_MERGE: “True”

EVENTSTORE_SKIP_INDEX_VERIFY: “True”

EVENTSTORE_SKIP_DB_VERIFY: “True”

I have yet to find which of the options actually gives the observed boost, or if it is the combination.

What do these options mean exactly, and **is it actually safe to enable them (on production)? **

I read the docs saying the following, but could anyone give me more details on that? Are the operations behind this expensive (IO or CPU-wise)?

Does disabling them actually skip a step forever (or just shift them to later when the data is actually accessed)?

OptimizeIndexMerge Bypasses the checking of file hashes of indexes during startup and after index merges. (Default: False)

SkipIndexVerify Skips reading and verification of PTables during start-up. (Default: False)

SkipDbVerify Bypasses the checking of file hashes of database during startup (allows for faster startup). (Default: False)

A second question: is there a way to increase the log level of Eventstore (esp. for disk IO related things), so I could get a better understanding of what takes so long?

Best regards,

Markus

steven.blair · October 16, 2019, 9:48am

Markus,

Have you made any progress on this?

We can’t get our Event Store to startup now, and unsure if it is related to your problem.

Our Production system is currently down, and our Event Store appears to start and indicates it is rebuilding Indexes, but that’s it.

Arno_den_Uijl · October 16, 2019, 4:52pm

Can u see if there is disk activity? E.g. with iostat in the container. If it is rebuilding indexes you should see read activity on the chunks and write on the indices

steven.blair · October 17, 2019, 7:44am

We will take a look.
Previously, I was sure the Event Store logged out it was rebuilding indexs 1 of etc but nothing.

Markus_Dobel · October 17, 2019, 8:43am

Hi,

current state here is, that we enabled all three options from above on our dev environment, and did not see any change (other than the restart being faster).

I still hope for someone knowledgeable to answer, if that’s safe in production.

We also had “stuck” nodes after boot once in a while (not getting out of “Initializing” or “CatchingUp” even after multiple hours), but I am not sure if this is related.

Might be a race condition more probably to happen when I/O is slow, might be completely unrelated. It always hit only one of the nodes at the same time, so the cluster was operable all the time.

Restarting these a second time helped in these cases.

Best regards,

Markus

Markus_Dobel · October 17, 2019, 9:42am

Meanwhile I tried these options in isolation.

Setup is Eventstore 4.1.3 with 5 Nodes on AKS, data as well as indexes on premium disks.

On the development env, we currently have around 80GB in 339 data chunks (numbered from 0000 to 0468 due to scavenges), and 4 Index files like these

-r–r--r-- 1 eventstore eventstore 94M Oct 10 09:30 830f4a14-5615-42c5-b0d9-6da160d089fc

-r–r--r-- 1 eventstore eventstore 735M Oct 9 19:25 c5746bd0-113f-4ea8-904f-deb830bad68f

-r–r--r-- 1 eventstore eventstore 1.5G Oct 9 14:21 943bdcea-ec4d-43a6-b882-18e56947d4a6

-r–r--r-- 1 eventstore eventstore 2.8G Oct 9 08:38 ae33e76f-56e3-403a-aa20-245f1dfd6c8e

-rw-r–r-- 1 eventstore eventstore 225 Oct 10 09:30 indexmap

Does it make sense that these are so different in size?

SkipIndexVerify=false, others set to true

Verifying indexes takes a considerable amount of time, and before indexes are not verified, the node stays “Initializing”. Log excerpts for this:

super-store-1 super-store [00006,20,17:37:52.821] Loading PTable (Version: 4) ‘830f4a14-5615-42c5-b0d9-6da160d089fc’ (4000280 entries, cache depth 16) done in 00:00:48.9836960.

super-store-1 super-store [00006,20,17:37:52.823] Loading and Verification of PTable ‘c5746bd0-113f-4ea8-904f-deb830bad68f’ started…

super-store-1 super-store [00006,20,17:44:19.161] Loading PTable (Version: 4) ‘c5746bd0-113f-4ea8-904f-deb830bad68f’ (32007762 entries, cache depth 16) done in 00:06:26.3385306.

super-store-1 super-store [00006,20,17:44:19.161] Loading and Verification of PTable ‘943bdcea-ec4d-43a6-b882-18e56947d4a6’ started…

super-store-1 super-store [00006,20,17:57:21.810] Loading PTable (Version: 4) ‘943bdcea-ec4d-43a6-b882-18e56947d4a6’ (64006586 entries, cache depth 18) done in 00:13:02.6475754.

super-store-1 super-store [00006,20,17:57:21.810] Loading and Verification of PTable ‘ae33e76f-56e3-403a-aa20-245f1dfd6c8e’ started…

super-store-1 super-store [00006,20,18:22:35.957] Loading PTable (Version: 4) ‘ae33e76f-56e3-403a-aa20-245f1dfd6c8e’ (124342337 entries, cache depth 19) done in 00:25:14.1465772.

super-store-1 super-store [00006,20,18:22:41.951] ReadIndex rebuilding done: total processed 673438 records, time elapsed: 00:00:05.9862890.

Summing up to 45 minutes in total for PTable loading and verification. During this operation, the CPU is utilized around 10-15% on one core only, so it seems to be more I/O than CPU bound.

Time from first log message to “IS SLAVE…”: 49 min

SkipDbVerify=false, others set to true

While DB verification also takes a considerable amount of time, it seems to happen “in the background”, not blocking the server start.

Log excerpt for first and last verification:

super-store-1 super-store [00007,06,18:35:15.738] Verifying hash for TFChunk ‘/var/lib/eventstore/chunk-000468.000000’…

…

super-store-1 super-store [00007,06,18:57:08.239] Verifying hash for TFChunk ‘/var/lib/eventstore/chunk-000000.000003’…

Totalling in 22 minutes for this alone. During this, the CPU usage is around 50% on one core, so also not completely CPU bound.

When both Skip*Verify options are OFF, both verification processes seem to “block” each other somehow, making them finish more or less at the same time.

Time from first log message to “IS SLAVE…”: 5 minutes

OptimizeIndexMerge=false, others set to true

This option alone does not seem to make a big difference – at least not at startup.

Time from first log message to “IS SLAVE…”: 6 minutes

All options set to true

Time from first log message to “IS SLAVE…”: 5 minutes

Do these numbers make sense?

And since Index verification is the largest slowdown: is it safe to disable?

Best regards,

Markus