Catch-up mode subscription stalled : after a while, we don't get any event

Thomas_Pierrain · April 5, 2018, 7:45am

Hi,

We recently faced some issues with
catch-up mode subscriptions leading to situations where our read projection
component stopped to receive events from our unique instance of ES (no cluster
involved here) whereas it should do => our read model started to stall. We
has put logs everywhere and no disconnection nor unsubscription has been
detected so far.

We are still investigating on that issue
(it just appears once in a while after a long-time running of our subscriber component) but wondering if
such problem rings a bell to one of you (or if you suggest us to have a look a any particular monitoring stuff).

For the record, we are subscribing to
system projection stream such as: $ce-fund, $ce-share or $ce-umbrella where
(fund, share and umbrella are the name of some of our aggregates, with original
streams like fund-01b35029-c13f-4e93-8379-9befe4bbdd9f ,
umbrella-eff71460-d77c-4c22-aad7-c395aafca996 )

Our event store is running on windows
and we use version 4.0.3.

Thomas

Greg_Young1 · April 5, 2018, 7:55am

What are in your client logs? You can enable them when creating the connection

Thomas_Pierrain · April 5, 2018, 8:40am

Thanks for your answer. Unfortunately we didn’t have the client logs enabled in production ;-(

We enabled it so that we can have more information to troubleshoot when the bug will reproduce.

We’ll get back to you with more info.

Thomas_Pierrain · April 6, 2018, 8:47am

Hi,

After more investigations yesterday, it seems that our “ReadProjections” service (subscribing to the ES to update our read model) had an incorrect log-level configuration. As a consequence, we had no log entry related to any (catch-up subscriptions related) connections/disconnections which was unexpected to us ;-( To put it differently, it is very likely that we had disconnections in PROD but no proof about it.

Worst, due to a lame refactoring (on our side), our connection setting .KeepReconnecting(); had disappeared, leading to situations where our service tried to reconnect 10 times (2 seconds approximately) after any disconnection. If it couldn’t manage to reconnect in that time frame, our service had stayed up but disconnected from our Event Store. Sad panda…

After a QA campaign to validate our assumptions, we fixed that situation by setting appropriate log levels in our app configuration, enabled ES verbose logs and restored our KeepReconnecting() settings in order to improve our platform robustness.

According to Occam’s razor, we have no reason to suspect any issue on the ES so far. Sorry for the inconvenience.