Uptick in 'Retrying message' logs (ES 5.0.1)

We’ve noticed a sudden and dramatic uptick in persistent subscription events being retried.

Retrying message {subscriptionId} {stream}/{eventNumber}

We’ve gone from almost none (in August, there were 13 across four days) to 5,000-10,000 per day, since a few days ago.

Previous deploy was two weeks ago.
There is nothing in our application logs that suggests that anything else is going wrong.
No evidence of connections dropping.

Does anyone have any ideas what could be causing this?

Is the persistent subscriptinos feeding a service for loading into a read model?
Pure guess here, but I suspect it’s whatever is consuming the events is 1) either failing or 2) taking long than usual. The default retry is 10 seconds per event.

They’ve all got timeouts of 30 seconds.
They’re generally pretty simple: Hydrate model from events, call method and persist.
Perhaps ES is taking too long to do one of the operations?
How can I tell whether ES is at its limit or close to?
I’ve got logging and statsd metrics around fetches (length of stream and time taken) but neither of them appear to indicate that ES is saturated (eg only 30 seconds of the minute spent reading events)

So your flow is this:

  1. Persist to model
  2. Event gets delivered via persistent Subscription to endpoint
  3. You rehydrate the same model, or a new model?

Inside the event susbcriber, have you got any logging to indicate how long the event is taking to process?

What’s the in-flight count set to on your persistent subscription?

  1. Is different model

There’s no logging in the handler

Buffer Size	
Check Point After	
Extra Statistics	
Live Buffer Size	
Max Checkpoint Count	
Max Retry Count	
Message Timeout (ms)	
Min Checkpoint Count	
Consumer Strategy	
Read Batch Size	
Resolve Link tos	
Start From Event

Looks fine to me.
Could you take one of the events, and manually post it in (via postman etc) and just check how long it takes?

Do you mean create one of the events that a handler would be listening for?

Yeah. On our systems here, our handlers are sitting behind a REST interface, so we simply use Postman to post the event in.