Dealing with failed event processing

Andy_Longshaw · December 2, 2014, 5:40pm

Apologies if this has been covered elsewhere. I took a quick look round the group but couldn’t spot anything.

I wondered if anyone can give guidance on good patterns for how to handle a failure in event processing. If we assume that there is a retry strategy in place for processing an event, if that fails you seem to be faced with a choice of:

Stop processing any events until the problem is fixed. Could be a long pause if the fix is not straightforward.
Carry on processing and compensate for it downstream (potentially rebuilding any state by replaying events).

I realise that this is very context-dependent based on the impact of the processing action (e.g. updating a read model vs sending emails) which is why I’m looking for patterns around this (or equivalent) to weigh up the forces at play. There are various patterns I know of from message processing in terms of things like dead letter queues so I wondered if there was an equivalent body of work around eventsourcing, even if it is only for a specific area such as maintaining read models.

Thanks

Andy

Greg_Young1 · December 2, 2014, 10:17pm

What subscription model are you over? If over competing there is a
parked message queue which is likely exactly what you are looking for.

For projections usually the best answer is *stop* the alternative is
continue then manual replay etc (if wanting to do in a generic way).
Compensation can be done but is projection specific.

Greg

Andy_Longshaw · December 3, 2014, 8:34pm

We are not using competing consumers currently. Using different, single processors for different styles of event (ones that send emails, ones that update the read model, etc.) so we can take different approaches based on different replay behaviours (e.g. can share an event marker on a single subscription between multiple email-generating event handlers as, for a brief delay, there is no real issue with re-sending the same email whereas taking the same approach across multiple read models could potentially leave the system in an inconsistent state until the problem event was processed by all the event handlers).

In both cases, our current approach is to stop processing and await fix or retry. Is there any good discussion of the pros and cons of designs around sharing subscriptions and markers vs how you implement parallelism vs efficient use of resources like subscriptions for a given problem context?

Andy