ES TFChuck verification at startup

Kasper_Nortoft · October 30, 2014, 3:27pm

Hi,

My team just migrated from ES 2.x.x.x to 3.0.1.0 and I just realized that the TFChunk verification at startup takes about a minute in 3.0.1.0 but completes in a few seconds in version 2.x.x.x? Has something been changed internally? I know its possible to disable the check at startup, but does ES still verify the TFChunks while its running?

Best regards

Kasper Nørtoft

Greg_Young1 · October 30, 2014, 4:44pm

How are you measuring? A big difference is whether the chunk is already in the file cache.

You are seeing 1 minute / chunk I’m guessing this is over a network?

Kasper_Nortoft · October 31, 2014, 1:48pm

I just noticed the performance differences in the log. The chunks are located on the same machine as ES so it isn’t a network problem.

ES is also rebuilding readIndex for every restart. Is it possible to prevent that?

Greg_Young1 · October 31, 2014, 3:23pm

What’s in your logs? It will build only up to 1m records by default on restart this is normally a fast operation.

In terms of differences in 3.0 we tell the os not to cache the files when we load them in windows (windows has file caching problems where it doesn’t release memory as such they get loaded off the physical disk) that said it should never take a minute to verify a single chunk. Is it one minute for all chunks or one minute per chunk? How many chunks do you have?

Greg

Kasper_Nortoft · November 3, 2014, 7:59am

The logs are from our test server. Is it safe to disable the startup verification of the TFChunks and do they get verified while the ES is running? The readindex rebuilding looks pretty weird to me. Shouldn’t ES save the index somewhere to be able to avoid rebuilding the index for every restart?

[PID:02464:020 2014.10.31 11:02:19.804 DEBUG IndexCommitter ] ReadIndex rebuilding done: total processed 806218 records, time elapsed: 00:02:41.8510375.

[PID:02464:008 2014.10.31 11:03:02.158 TRACE TFChunk ] Verifying hash for TFChunk ‘C:\Program Files\EventStore\Db\chunk-000010.000000’…

[PID:02464:008 2014.10.31 11:03:32.048 TRACE TFChunk ] Verifying hash for TFChunk ‘C:\Program Files\EventStore\Db\chunk-000009.000000’…

[PID:02464:008 2014.10.31 11:04:05.027 TRACE TFChunk ] Verifying hash for TFChunk ‘C:\Program Files\EventStore\Db\chunk-000008.000000’…

[PID:02464:008 2014.10.31 11:04:42.872 TRACE TFChunk ] Verifying hash for TFChunk ‘C:\Program Files\EventStore\Db\chunk-000007.000000’…

[PID:02464:008 2014.10.31 11:05:20.188 TRACE TFChunk ] Verifying hash for TFChunk ‘C:\Program Files\EventStore\Db\chunk-000006.000000’…

[PID:02464:008 2014.10.31 11:06:06.146 TRACE TFChunk ] Verifying hash for TFChunk ‘C:\Program Files\EventStore\Db\chunk-000005.000000’…

[PID:02464:008 2014.10.31 11:06:44.350 TRACE TFChunk ] Verifying hash for TFChunk ‘C:\Program Files\EventStore\Db\chunk-000004.000000’…

[PID:02464:008 2014.10.31 11:07:20.324 TRACE TFChunk ] Verifying hash for TFChunk ‘C:\Program Files\EventStore\Db\chunk-000003.000000’…

[PID:02464:008 2014.10.31 11:07:54.598 TRACE TFChunk ] Verifying hash for TFChunk ‘C:\Program Files\EventStore\Db\chunk-000002.000000’…

[PID:02464:008 2014.10.31 11:08:27.686 TRACE TFChunk ] Verifying hash for TFChunk ‘C:\Program Files\EventStore\Db\chunk-000001.000000’…

[PID:02464:008 2014.10.31 11:09:05.906 TRACE TFChunk ] Verifying hash for TFChunk ‘C:\Program Files\EventStore\Db\chunk-000000.000000’…

Best regards

Kasper Nørtoft

Greg_Young1 · November 3, 2014, 8:14am

The read index builds up to 1m items by default. The way indexes work there is always some amount transient in memory and some amount persistent on disk (they are batched) the sizes can be controlled as well btw via command line. Eg on a brand new database there are no index files. Once it hits 1m items a persistent index is written en if you add 5 events the process is 1) load persistent indexes 2) replay 5 events for current indexes. This process for 1m events takes about 3 seconds on my laptop.

What disks are you running on that it takes 30-40 seconds to read 256mb from disk this is very very slow? A verify is just read->md5 checksum this time is normally covered by io. https://github.com/EventStore/EventStore/blob/dev/src/EventStore.Core/TransactionLog/Chunks/TFChunk/TFChunk.cs#L375

Greg

Kasper_Nortoft · November 3, 2014, 9:49am

Thank you for the explanation. It seems like we have a performance issue on our side.

Greg_Young1 · November 3, 2014, 9:59am

So we can better look at this in the future what is the setup for your disks?

Kasper_Nortoft · November 4, 2014, 4:43pm

The server I've tested on is a virualized server connected to a SAN with commodity disks. We tried to increase the server ressources and we already saw significantly better performance.