Every start needs an index rebuild

urbanhusky · May 9, 2016, 7:35am

Every time I start ES, it needs to perform a lengthy index rebuild.
I start it, wait for the rebuild to finish (when the web UI finally reacts to logging in) and then stop it via the web UI.
On the next start, it has to rebuild everything again.

This freaks me out and kills any potential chance of using ES in production. We can’t afford to stand still in multiple plants for hours at a time.

127.0.0.1-2113-cluster-node.log (127 KB)

Greg_Young1 · May 9, 2016, 11:32am

ES on restart rebuilds its memory table the size of the memory table
is configurable it defaults to 1,000,000 entries.

This process happens on my laptop in about two seconds. This leads me
to believe you are either writing massive events or you are running on
quite interesting hardware.

Greg

urbanhusky · May 9, 2016, 11:37am

Win7 Ultimate, i7-3770k, 16,0 GB, SSD as system drive, 7200 rpm HDD for data (and ES executable). There shouldn’t be anything unusual about this setup.
However, I did restore a backup where I had to delete a new chunk.

Greg_Young1 · May 9, 2016, 11:50am

Its a forward read (sequential). Why it would be taking hours on your
machine is likely environmental (others do not see this issue, as I
mentioned it takes +- 2 seconds on my laptop), I might also look at
things like anti-virus programs etc. As I see you have 22 chunks even
reading forward all 22 chunks should take less time than an hour
(5.5GB). Back up/restores have nothing to do with this.

urbanhusky · May 9, 2016, 12:02pm

I tried disabling the anti-virus, but that had no observable effect. I can’t really uninstall it just to test it.
I’m seeing that ES reads two or three chunks at a time, but only very slowly (~1-2 MB/s). I can copy files off that disk just fine at >150 MB/s so I would rule out issues with the drive itself.
I even copied the same files that ES reads at well beyond 30 MB/s (while ES was reading them at <1MB/s).

I tried to kill almost everything that is running on my machine and it still has had no effect.
I can try to set up a second instance of ES and fill it with a couple of million events to see how that instance behaves on start/stop.

ES seems to rebuild the entire index, not just 1,000,000 entries?

Greg_Young1 · May 9, 2016, 12:12pm

ES builds up to 1m entries unless it says its doing a full index
rebuild due to corrupt files on disk etc.

If you read the logs it even tells you how many entries it rebuilt:

[PID:05908:017 2016.05.09 07:03:25.103 DEBUG IndexCommitter ]
ReadIndex rebuilding done: total processed 866754 records, time
elapsed: 01:03:01.9556591.

To test I just created a db with +- 950k records in it. I used
testclient and did a wrfl 10 1000000 (stopped it just short of 1m). I
then restarted the node.

[22295,10,12:07:30.278] CACHED TFChunk #0-0 (chunk-000000.000000) in
00:00:00.8777061.
[22295,16,12:07:30.332] ReadIndex Rebuilding: processed 100000 records (10.9%).
[22295,16,12:07:30.778] ReadIndex Rebuilding: processed 200000 records (21.9%).
[22295,16,12:07:31.277] ReadIndex Rebuilding: processed 300000 records (32.8%).
[22295,16,12:07:31.746] ReadIndex Rebuilding: processed 400000 records (43.7%).
[22295,16,12:07:32.257] ReadIndex Rebuilding: processed 500000 records (54.6%).
[22295,16,12:07:32.780] ReadIndex Rebuilding: processed 600000 records (65.6%).
[22295,16,12:07:33.270] ReadIndex Rebuilding: processed 700000 records (76.5%).
[22295,16,12:07:33.782] ReadIndex Rebuilding: processed 800000 records (87.4%).
[22295,16,12:07:34.278] ReadIndex Rebuilding: processed 900000 records (98.3%).
[22295,16,12:07:34.356] ReadIndex rebuilding done: total processed
915193 records, time elapsed: 00:00:04.7460410.

This is on battery power so its a bit slower than when plugged in. As
I mentioned previously there are two possibilities in terms of
differences, the first is size of events (eg if all my events were
500kb then reading 1m is far more work than if they are 500 bytes) the
second is environmental.

You can also control the size of the memtable using --max-mem-table-size

Cheers,

Greg

urbanhusky · May 9, 2016, 12:45pm

I’m seeing a similar effect with a fresh DB and 6 million small events (content is just “some data for #eventnumber”), starting ES takes a comparatively long time - but there is no index rebuild happening.

127.0.0.1-2114-cluster-node.log (66.4 KB)

Greg_Young1 · May 9, 2016, 12:50pm

You might want to try your test again.

[PID:04648:016 2016.05.09 12:22:03.629 INFO TcpServerListener ]
Starting Normal TCP listening on TCP endpoint: 127.0.0.1:1114.
[PID:04648:017 2016.05.09 12:22:03.657 INFO IndexCommitter ]
TableIndex initialization...
[PID:04648:017 2016.05.09 12:22:03.657 INFO IndexCommitter ]
ReadIndex building...
[PID:04648:017 2016.05.09 12:22:03.657 DEBUG IndexCommitter ]
ReadIndex rebuilding done: total processed 0 records, time elapsed:
00:00:00.0010001.

The reason that it did not process any events is:

[PID:04648:001 2016.05.09 12:22:03.279 INFO ProgramBase`1 ]
DATABASE: D:\Temp\eventstore\testdb2114.db
[PID:04648:001 2016.05.09 12:22:03.279 INFO ProgramBase`1 ]
WRITER CHECKPOINT: 0 (0x0)
[PID:04648:001 2016.05.09 12:22:03.279 INFO ProgramBase`1 ]
CHASER CHECKPOINT: 0 (0x0)

There are no events to process.

Cheers,

Greg

Greg_Young1 · May 9, 2016, 12:57pm

Reading further (didnt realize there were multiple logs in one).

[PID:07880:017 2016.05.09 12:43:06.529 INFO IndexCommitter ]
ReadIndex building...
[PID:07880:017 2016.05.09 12:43:06.554 DEBUG IndexCommitter ]
ReadIndex rebuilding done: total processed 41 records, time elapsed:
00:00:00.0210042.

On every restart you are getting readindex rebuilds.

Try putting in 950k (remember at 1m intervals it dumps the index to
disk so if you use a 1m boundary on your test it will have very little
work to do).

urbanhusky · May 9, 2016, 12:57pm

Which isn’t surprising considering that a new database is empty - I created a completely new database for this and started filling it.
The last start was with the 6 million events, but it only built an index of 41?

[PID:07880:017 2016.05.09 12:41:29.042 TRACE PTable ] Loading PTable ‘722a04e7-93f9-4c7f-98c5-ff88397e32ba’ (2000000 entries, cache depth 16) done in 00:00:27.4367350.
[PID:07880:017 2016.05.09 12:41:29.042 TRACE PTable ] Loading and Verification of PTable ‘c9458667-63a8-43ad-a16a-9b6c8f6c8a8f’ started…
[PID:07880:017 2016.05.09 12:43:06.529 TRACE PTable ] Loading PTable ‘c9458667-63a8-43ad-a16a-9b6c8f6c8a8f’ (4000000 entries, cache depth 16) done in 00:01:37.4700003.
[PID:07880:017 2016.05.09 12:43:06.529 INFO IndexCommitter ] ReadIndex building…
[PID:07880:017 2016.05.09 12:43:06.554 DEBUG IndexCommitter ] ReadIndex rebuilding done: total processed 41 records, time elapsed: 00:00:00.0210042.

Greg_Young1 · May 9, 2016, 1:02pm

You seem to have some misunderstandings of what is happening with
rebuilding the read index.

There are two parts of the read index, memtables and ptables. ptables
are on disk memtables are in memory. When a memtable is filled
(--max-mem-table-size) it is written down to disk as a ptable. On
start up we rebuild the current memtable, this requires reading *up
to* --max-memtable-size - 1 records (it can actually be slightly more
when under load as the ptable writing is asynchronous). In your case
you wrote 6m events and ES internally writes some (like
users/statistics etc). As such you only have 41 records to load for
your memory table (6,000,041 % 1,000,000 (default memtable size) = 41.
If you were to write 950k events (just under --max-mem-table-size then
you will get 950k read)

Cheers,

Greg

urbanhusky · May 9, 2016, 1:28pm

Okay, that still doesn’t explain why it’s so unbearably slow to load.
Actually, I think there is an overall bottleneck somewhere - I don’t see any reads or writes beyond 1-2MB/s (except an initial and very brief spike when starting ES).

So, the question why there is a “rebuild” on every start is solved - but not as to why reading anything seems to be so slow.

Funny thing is, I just observed another spike in throughput as ES was rebuilding the read index (I added 1.5 million additional events to the store).

[PID:09740:017 2016.05.09 13:22:52.250 TRACE PTable ] Loading PTable ‘04855808-dac9-4d87-9cf6-170a0c0b2c23’ (1000000 entries, cache depth 16) done in 00:00:26.1696476.
[PID:09740:017 2016.05.09 13:22:52.250 TRACE PTable ] Loading and Verification of PTable ‘722a04e7-93f9-4c7f-98c5-ff88397e32ba’ started…
[PID:09740:017 2016.05.09 13:23:54.356 TRACE PTable ] Loading PTable ‘722a04e7-93f9-4c7f-98c5-ff88397e32ba’ (2000000 entries, cache depth 16) done in 00:01:02.1012923.
[PID:09740:017 2016.05.09 13:23:54.356 TRACE PTable ] Loading and Verification of PTable ‘c9458667-63a8-43ad-a16a-9b6c8f6c8a8f’ started…
[PID:09740:017 2016.05.09 13:25:45.819 TRACE PTable ] Loading PTable ‘c9458667-63a8-43ad-a16a-9b6c8f6c8a8f’ (4000000 entries, cache depth 16) done in 00:01:51.4569070.
[PID:09740:017 2016.05.09 13:25:45.820 INFO IndexCommitter ] ReadIndex building…
[PID:09740:017 2016.05.09 13:25:46.192 DEBUG IndexCommitter ] ReadIndex Rebuilding: processed 100000 records (20,2%).
[PID:09740:017 2016.05.09 13:25:46.558 DEBUG IndexCommitter ] ReadIndex Rebuilding: processed 200000 records (40,2%).
[PID:09740:017 2016.05.09 13:25:46.904 DEBUG IndexCommitter ] ReadIndex Rebuilding: processed 300000 records (60,1%).
[PID:09740:017 2016.05.09 13:25:47.266 DEBUG IndexCommitter ] ReadIndex Rebuilding: processed 400000 records (80,1%).
[PID:09740:012 2016.05.09 13:25:47.452 TRACE TFChunk ] Verifying hash for TFChunk ‘D:\Temp\eventstore\testdb2114.db\chunk-000001.000000’…
[PID:09740:017 2016.05.09 13:25:47.665 DEBUG IndexCommitter ] ReadIndex Rebuilding: processed 500000 records (100,0%).
[PID:09740:017 2016.05.09 13:25:47.665 DEBUG IndexCommitter ] ReadIndex rebuilding done: total processed 500101 records, time elapsed: 00:00:01.8401840.

Greg_Young1 · May 9, 2016, 1:37pm

PTable loads are literally just a forward read of a file. Those are
small files as well.

[PID:09740:017 2016.05.09 13:23:54.356 TRACE PTable ]
Loading and Verification of PTable
'c9458667-63a8-43ad-a16a-9b6c8f6c8a8f' started...
[PID:09740:017 2016.05.09 13:25:45.819 TRACE PTable ]
Loading PTable 'c9458667-63a8-43ad-a16a-9b6c8f6c8a8f' (4000000
entries, cache depth 16) done in 00:01:51.4569070.

As example. If you look in db directory/index there will be that file
there (and im guessing its not very large).

Do you get the same behaviour if you run with the option
--skip-db-verify. I'm wondering if the loading of indexes and the
verifying of chunks at the same time might be causing a large number
of seeks on the disk.

[PID:09740:017 2016.05.09 13:25:45.820 INFO IndexCommitter ]
ReadIndex building...
[PID:09740:017 2016.05.09 13:25:46.192 DEBUG IndexCommitter ]
ReadIndex Rebuilding: processed 100000 records (20,2%).
[PID:09740:017 2016.05.09 13:25:46.558 DEBUG IndexCommitter ]
ReadIndex Rebuilding: processed 200000 records (40,2%).
[PID:09740:017 2016.05.09 13:25:46.904 DEBUG IndexCommitter ]
ReadIndex Rebuilding: processed 300000 records (60,1%).
[PID:09740:017 2016.05.09 13:25:47.266 DEBUG IndexCommitter ]
ReadIndex Rebuilding: processed 400000 records (80,1%).
[PID:09740:012 2016.05.09 13:25:47.452 TRACE TFChunk ]
Verifying hash for TFChunk
'D:\Temp\eventstore\testdb2114.db\chunk-000001.000000'...
[PID:09740:017 2016.05.09 13:25:47.665 DEBUG IndexCommitter ]
ReadIndex Rebuilding: processed 500000 records (100,0%).

Is about what I would expect to see on most systems.

Joao_Braganca · May 9, 2016, 1:50pm

What happens when you try this on a fresh laptop with no AV having ever been installed?

urbanhusky · May 9, 2016, 1:51pm

–skip-db-verify has a significant [PID:09180:017 2016.05.09 13:42:55.655 TRACE PTable [PID:09180:017 2016.05.09 13:42:55.655 TRACE PTable [PID:09180:017 2016.05.09 13:42:56.507 TRACE PTable [PID:09180:017 2016.05.09 13:42:56.507 TRACE PTable [PID:09180:011 2016.05.09 13:42:56.678 TRACE QueuedHandlerThreadP] [PID:09180:017 2016.05.09 13:42:58.424 TRACE PTable [PID:09180:017 2016.05.09 13:42:58.424 INFO IndexCommitter [PID:09180:017 2016.05.09 13:42:58.757 DEBUG IndexCommitter [PID:09180:017 2016.05.09 13:42:59.096 DEBUG IndexCommitter [PID:09180:017 2016.05.09 13:42:59.472 DEBUG IndexCommitter [PID:09180:017 2016.05.09 13:42:59.829 DEBUG IndexCommitter [PID:09180:017 2016.05.09 13:43:00.161 DEBUG IndexCommitter [PID:09180:017 2016.05.09 13:43:00.161 DEBUG IndexCommitter On the previous database, this also has a significant Still, it is a major difference between startup taking [PID:04596:017 2016.05.09 13:46:37.061 TRACE PTable [PID:04596:017 2016.05.09 13:46:37.065 INFO IndexCommitter [PID:04596:017 2016.05.09 13:46:42.065 DEBUG IndexCommitter …
[PID:04596:017 2016.05.09 13:48:32.834 DEBUG IndexCommitter [PID:04596:017 2016.05.09 13:48:33.112 DEBUG IndexCommitter Compared to the previous start:
[PID:05908:017 2016.05.09 07:03:25.103 DEBUG IndexCommitter impact. ES start took now only a few seconds on the new database:
] Loading PTable ‘04855808-dac9-4d87-9cf6-170a0c0b2c23’ (1000000 entries, cache depth 16) done in 00:00:00.6130039.
] Loading and Verification of PTable ‘722a04e7-93f9-4c7f-98c5-ff88397e32ba’ started…
] Loading PTable ‘722a04e7-93f9-4c7f-98c5-ff88397e32ba’ (2000000 entries, cache depth 16) done in 00:00:00.8509992.
] Loading and Verification of PTable ‘c9458667-63a8-43ad-a16a-9b6c8f6c8a8f’ started…
SLOW QUEUE MSG [MonitoringQueue]: SystemInit - 1681ms. Q: 0/0.
] Loading PTable ‘c9458667-63a8-43ad-a16a-9b6c8f6c8a8f’ (4000000 entries, cache depth 16) done in 00:00:01.9170225.
] ReadIndex building…
] ReadIndex Rebuilding: processed 100000 records (20,1%).
] ReadIndex Rebuilding: processed 200000 records (39,9%).
] ReadIndex Rebuilding: processed 300000 records (59,8%).
] ReadIndex Rebuilding: processed 400000 records (79,7%).
] ReadIndex Rebuilding: processed 500000 records (99,4%).
] ReadIndex rebuilding done: total processed 500135 records, time elapsed: 00:00:01.7321732.
effect - but startup is still within a few minutes. There are slightly bigger events (~4k) that were added last.
3 minutes or over 1 hour.
] Loading PTable ‘f010d2d6-8fbc-4b70-9fe5-cd943d622996’ (8000000 entries, cache depth 16) done in 00:01:00.2598900.
] ReadIndex building…
] ReadIndex Rebuilding: processed 23874 records (2,8%).
] ReadIndex Rebuilding: processed 800000 records (92,3%).
] ReadIndex rebuilding done: total processed 866788 records, time elapsed: 00:01:56.0473639.
] ReadIndex rebuilding done: total processed 866754 records, time elapsed: 01:03:01.9556591.

Greg_Young1 · May 9, 2016, 1:56pm

So what the db verify is doing is calculating md5 hashes of all of the
chunk files and verifying them against precalculated md5 hashes.
Generally you can run without this step (its rare that it ever fails).
My guess is with your spindle hdd this is causing a large number of
seeks between the files.

urbanhusky · May 9, 2016, 2:04pm

I’m not that comfortable with disabling such features that should help in detection of data integrity issues - especially in our potential use case (deployment to customer-managed servers).
Are you assuming that everybody runs ES on SSDs?

Greg_Young1 · May 9, 2016, 2:16pm

So we could do it later but then it would be the same problem, it
would conflict with operations of the running db (same seeking
problem).

The files are all immutable files on disk. The chances of getting a
failure are very small, normally it would only happen on a failing
disk. Likely in a production environment with spindle disks you are
also running on a RAIDed drive with error correction making it even
less likely to happen. Beyond this many file systems are already doing
such checks so doing it twice is redundant.

Looking further, this issue would not exist in a RAIDed environment as
you can do multiple concurrent reads without the seeks.

Do you intend to run your production environment on single spindle disks?

Cheers,

Greg

urbanhusky · May 9, 2016, 2:27pm

I’m not sure how much influence we have on our customer’s server choices (we’re considering integrating ES with an OEM license - we’re already in contact regarding this).
We certainly can make some recommendations to our customer but I’m not sure if they’ll stick to it - but at least we’ll know about it when it comes up.

Thank you for your help, I now have some more material to work with.
Maybe I’ll get around to testing this on a spindle disk RAID.

Regards,
Thomas

Greg_Young1 · May 9, 2016, 2:34pm

As a side note, yes we do recommend running on SSDs if possible. This
is especially true for large dbs (even if only putting the index on
SSD)