Uptick in 'Retrying message' logs (ES 5.0.1)

Mike_Boehm · September 7, 2020, 5:59am

We’ve noticed a sudden and dramatic uptick in persistent subscription events being retried.

Retrying message {subscriptionId} {stream}/{eventNumber}

We’ve gone from almost none (in August, there were 13 across four days) to 5,000-10,000 per day, since a few days ago.

Previous deploy was two weeks ago.
There is nothing in our application logs that suggests that anything else is going wrong.
No evidence of connections dropping.

Does anyone have any ideas what could be causing this?

steven.blair · September 7, 2020, 8:26am

Is the persistent subscriptinos feeding a service for loading into a read model?
Pure guess here, but I suspect it’s whatever is consuming the events is 1) either failing or 2) taking long than usual. The default retry is 10 seconds per event.

Mike_Boehm · September 7, 2020, 9:37am

They’ve all got timeouts of 30 seconds.
They’re generally pretty simple: Hydrate model from events, call method and persist.
Perhaps ES is taking too long to do one of the operations?
How can I tell whether ES is at its limit or close to?
I’ve got logging and statsd metrics around fetches (length of stream and time taken) but neither of them appear to indicate that ES is saturated (eg only 30 seconds of the minute spent reading events)

steven.blair · September 7, 2020, 9:44am

So your flow is this:

Persist to model
Event gets delivered via persistent Subscription to endpoint
You rehydrate the same model, or a new model?

Inside the event susbcriber, have you got any logging to indicate how long the event is taking to process?

What’s the in-flight count set to on your persistent subscription?

Mike_Boehm · September 7, 2020, 9:55am

Is different model

There’s no logging in the handler

Buffer Size	
20	
Check Point After	
2000	
Extra Statistics	
false	
Live Buffer Size	
500	
Max Checkpoint Count	
1000	
Max Retry Count	
500	
Message Timeout (ms)	
30000	
Min Checkpoint Count	
10	
Consumer Strategy	
RoundRobin	
Read Batch Size	
10	
Resolve Link tos	
true	
Start From Event
-1

steven.blair · September 7, 2020, 10:57am

Looks fine to me.
Could you take one of the events, and manually post it in (via postman etc) and just check how long it takes?

Mike_Boehm · September 8, 2020, 10:16am

Do you mean create one of the events that a handler would be listening for?

steven.blair · September 8, 2020, 10:19am

Yeah. On our systems here, our handlers are sitting behind a REST interface, so we simply use Postman to post the event in.