Problems with stable subscriptions when subscriptions drop or connection is reconnected

So I’ve found the ShouldStop flag in the code… and think i might have worked it out. The SubscriptionDropReason.CatchUpError and SubscriptionDropReason.UserInitiated reasons both cause the RunSubscription method to exit before the OnReconnect event is attached.

Not sure where the other enum values come from… :slight_smile:

Tom

https://github.com/EventStore/EventStore/blob/dev/src/EventStore.ClientAPI/ClientOperations/SubscriptionOperation.cs#L108
Here you go!

Cool, I’m having a hunt through the code now :slight_smile:

I guess there are 3 possible states:

  1. ES auto-reconnects (i.e. ConnectionClosed), cool just let it

  2. Do not reconnect (i.e. AccessDenied). bad things have happened, no point retrying

  3. Manually handle it (i.e. CatchUpError). I’m wondering what the underlying reasons for this might be and how recoverable it is.

As a side question, does ES provide a way to monitor/query subscriptions? i,e, if i create 10 subscriptions, is there an API where I can query what the state of each of them is? I’m presuming not, and we have to handle this in our application code through subscribing to the events.

Tom

The whole point of the subscription model for catch up subscription is
that its a client driven subscription the server has no subscription
state (only the client) if you want such things use persistent
subscriptions (competing consumers)

Ok, thought as much, I think we will end up moving to competing consumers once it’s released.

thanks for your help

tom

Its in dev and works. There is even a ui underway for it to show you
all the stuff you are asking about here :slight_smile:

Did you ever figure out which ones could resubscribe and which ones auto-retry?

As near as I can tell only a ConnectionClosed will auto-retry, and that’s due to an OnReconnecting hook.

https://github.com/EventStore/EventStore/blob/dev/src/EventStore.ClientAPI/EventStoreCatchUpSubscription.cs#L175

Seems like CatchUpError, SubscribingError, and ProcessingQueueOverflow (and maybe ServerError?) could back-off resubscribe. UserInitiated is just shutdown. And the rest are crashable to a view update service.

We’ll improve the docs on this. Originally the idea was that on a catch-up subscription everything except a user dropping the subscription by calling subscription.Stop() would automatically continue on reconnect, and that for volatile subscriptions nothing would continue on reconnect. That might not be the current state of the world though (I’d need to look through the code to investigate). This could probably use better test coverage as well.

"Originally the idea was that on a catch-up subscription everything
*except* a user dropping the subscription by calling
`subscription.Stop()"

I know handler error also was intended to not by default.

Its worth looking through (and documenting better)

Sounds good. I love the auto reconnect functionality. I just want to handle a drop properly.

I decided that for me, the only drop reason to safely retry on (after a backoff period) is ProcessingQueueOverflow.

Most of the other errors I will get on restart (or at least a view restart), and I had rather know that immediately than have it retry for a period. So I let them cause a crash. For example, if you start a subscription with a connection that has already been closed, it gets immediately dropped with CatchUpError, because it can’t read the old events.

ConnectionDropped is retried automatically, so I ignore this one

UserInitiated, I also ignore, because I call subscription.Stop synchronously, so I don’t need/want to be notified through a back channel that it stopped.

As it’s still really hard to figure out which DropReason’s will and which won’t result in reconnects, is there some documentation for this by now?

On non-catchup subscriptions, I haven’t been able to find any resubscribe logic. On a catchup subscription, as far as I can tell only ConnectionClosed will resubscribe, and that’s only if the subscription has already started live processing. A failure before live processing started will have resulted in a CatchUpError or UserInitiated, and no resubscribe will be attempted.

https://github.com/EventStore/EventStore/blob/dev/src/EventStore.ClientAPI/EventStoreCatchUpSubscription.cs#L222

As far as manually resubscribing…

I only see one drop reason that is absolutely resubscribe-able: ProcessingQueueOverflow, after draining the queue.

CatchUpError might be resub-able. It happens when there is any kind of error while reading catchup events or trying to create a live subscription after catching up. The likely scenario is that the server can’t be reached when you start the catch-up subscription. It could also happen after an automatic resubscribe, but the server/network would have to be pretty unstable to successfully reconnect but fail on catchup. Either way, my choice is to not retry… it’s likely to happen during a maintenance window when starting a view update service, and I want ops to see it to fail/stop immediately, not minutes later after retries are exhausted.

ServerError, it is doubtful that a resub will help. It happens when the server responds with bad TCP command, or with an unknown response. Considering TCP is a “reliable” protocol, there’s likely a bug or the payload has exceeded some platform limit, and resending isn’t going to help.

Who knows if Unknown is retry-able. It happens when the client receives a response with an unexpected drop reason or if there is any kind of error processing the server’s response.

The other errors are terminal. Either:

  • subscription stopped manually, i.e. before exiting the app/service (UserInitiated)
  • security configuration problem (NotAuthenticated, AccessDenied)
  • event handling code is broken (EventHandlerException)
    Oh, and SubscribingError doesn’t appear to be hooked up to anything. It’s called from an internal method, but I can’t find any place where that method is called.

Let me know if I missed anything.