v5 migration: $by_correlation_id projection stuck

Clement_Bouillier · August 28, 2019, 8:11am

Hello,

I am trying to migrate to v5.0.2 (from 4.1.1 backup). Everything sounds great at first, $by_correlation_id projection starts.

But at 83.8%, the projection is stuck but still marked as running without error (in the UI).

In the log, when it reaches this point, I notice :

lots of VERY SLOW QUEUE MSG on ReadAllEventsForward (with 17/18s response time)
lots of “Read All Stream Events Forward operation has expired” at the position the projection is blocked
lots of “Read Stream Events Forward operation has expired” on $users-password-notifications
lots of “Read Stream Events Backward operation has expired” on $scavenges
without any connection on the ES, so can only be internal processing (ReadAll make me think it can be projection processing)
the UI became mostly unresponsive.

If I stop the $by_correlation_id projection, no more VERY SLOW QUEUE MSG, eveything come back to normal.

I tried it on several computers (the best one have with 12 logical cores, 16Go RAM, SSD), several times, each time stopping at the same point where the projection is stuck (C:129373076598/P:129372351362). Note I also tried an index rebuilding, it just changed a bit the stuck position (C:129373076532/P:129372352150).

Here is in extract of the logs (repeating):

[25940,33,07:59:00.629] Read Stream Events Forward operation has expired for Stream: “$users-password-notifications”, From Event Number: 5, Max Count: 20. Operation Expired at 28/08/2019 07:58:56

[25940,33,07:59:00.629] Read All Stream Events Forward operation has expired for C:129373076532/P:129372352150. Operation Expired at 28/08/2019 07:58:58

[25940,06,07:59:12.635] Timeout reading stream: “$users-password-notifications”. Trying again in 10 seconds.

[25940,32,07:59:13.466] SLOW QUEUE MSG [“StorageReaderQueue #3”]: “ReadAllEventsForward” - 15072ms. Q: 0/3.

[25940,32,07:59:13.466] —!!! VERY SLOW QUEUE MSG [“StorageReaderQueue #3”]: “ReadAllEventsForward” - 15072ms. Q: 0/3.

[25940,32,07:59:13.466] Read Stream Events Forward operation has expired for Stream: “$users-password-notifications”, From Event Number: 5, Max Count: 100. Operation Expired at 28/08/2019 07:59:12

[25940,32,07:59:30.345] SLOW QUEUE MSG [“StorageReaderQueue #3”]: “ReadAllEventsForward” - 16879ms. Q: 0/5.

[25940,32,07:59:30.345] —!!! VERY SLOW QUEUE MSG [“StorageReaderQueue #3”]: “ReadAllEventsForward” - 16879ms. Q: 0/5.

[25940,32,07:59:30.345] Read Stream Events Backward operation has expired for Stream: “$scavenges”, From Event Number: -1, Max Count: 1. Operation Expired at 28/08/2019 07:59:23

[25940,32,07:59:30.345] Read All Stream Events Forward operation has expired for C:154482221397/P:0. Operation Expired at 28/08/2019 07:59:26

Does anyone have an idea what is happening ? Does the support could help on this (we have to subscribe…time passing) ?

Regards

Clément

Shaan_Nobee1 · August 28, 2019, 9:17am

Hello Clément,

Thanks for all the details.

Would you be willing to share your database with us for easier reproduction of the issue and root cause analysis?

If you have a support account, please feel free to open a ticket and we’ll follow up from there.

Thanks!

Shaan from EventStore

Clement_Bouillier · August 28, 2019, 11:43am

Hello Shaan,

Thanks for your quick answer. I have to check with my company how/if we can share the db. I think I will try to accelerate the process to subscribe commercial version ;).

Regards

Clément

Clement_Bouillier · September 30, 2019, 11:35am

Hi,

In the end, we can’t share the db (confidential data). We are also trying to acquire the commercial licence (some internal complication…).

We managed to identifiy more precisely where the issue is, but still no idea why it occurs, perhaps someone could have an idea.

In fact, the issue appear at a given position when reading $all stream (ReadAllEventsForward). So it appears when resetting any system projection or even reading $all stream directly in the UI. It also appears with v4.1.1 (currently used in production). We can reproduce with a ReadAllEventsForward on the incriminated commit/prepare position with client API.

The slice size used by system projection is 250 with a 10s read timeout, and it timeouts on some of our events. These events are very small events, we are running on SSD. There is no issue when reading start or end of $all steam for example, even if events are much bigger (just few ms for 250 events). We managed to unblock the system projection setting higher read timeout (hard coded). We have 8s delay to read 100 events around the incriminated commit/prepare position.

We dig a bit more in EventStore source code, and we found time wasted in ~3000 loops by event search for the Prepare RecordType at the correct TransactionPosition (EventStore.Core.Services.Storage.ReaderIndex.AllReader, method ReadAllEventsForward in LogRecordType.Commit case of the switch). So for 100 events, it takes up to 8s (only in the incriminated slice) versus 20ms at other position, where only LogRecordType.Prepare records are found.

Thanks for your feedback. Any help/tips appreciated.

Clément