Handling errors in denormalisers

Hi, we have some questions about options for dealing with errors in our denormalisers.

Currently we create projections to route events which one service’s denormalisers are interested in to that service, and then process those through the denormalisers. When something goes wrong when processing the events through the denormaliser and an exception is thrown we drop the connection whilst we investigate why the event processing failed. We want to do this because we don’t want to continue processing the events whilst the state of the denormalisation may be unknown. The consequence of this is that now any functionality which depends on the denormalisation of the events in that service will be off line.

The issue we have is that this approach means a single denormalisation failure brings down the whole of one part of the system, until the issue is resolved, which may have significant impact.

We have been considering options.

Option 1.

allowing the denormalisers to partition the events, so rather than having a single subscriber which subscribes to the entire stream, instead we would have each service start many subscribers, each subscribing to part of the data. So something like rather than a projection which routes all events the service is interested into a single stream it routes them into several streams, one per office say, then each subscriber subscribes to one offices events. This has the advantage that now when an event fails to process it only stops a single office from having that services functionality offline, all other offices will still function.

This seems like a win.

However this comes at a cost. Now rather then 1 connection per service to event store, we have n connections where n = number of offices. And if we have x services we will now have n * x total connections.

At what point does the number of connections become an issue? if we had 1000 offices and 20 services is this a problem?

Obviously we can divide up the projections and subscriptions so that rather than 1 per office we put them into buckets and then subscribe to the buckets, but what is the right number of buckets to choose to play nicely with event store?

Option 2

Allowing the denormalisers to carry on when they can’t process an event.

In this case we log there was a problem and carry on processing. this has the advantage that things don’t go down, and depending on the nature of the events we can probably carry on processing most of the events successfully. Down side is that we might compound problems if one event needs the previous event to be processed successfully. this might mean that most customers in one office can continue to be processed, but one customer might fail all their events after the first one. It seems like fixing this might also be an issue. Short term we could manually add in the denormalised data (ie update the db with a SQL script) but really we think we will need to emit a compensating event and handle that in the denormaliser in case the denormaliser needs to be run again. This requires code change (to handle the new event) and to publish the compensating event.

We would like to know a couple of things.

  1. Are these generally the way that people solve these problems when using event store?

  2. Are they alternatives we have not considered?

  3. for option 1 what are the practical connection limits that we should consider if we take this approach.

“However this comes at a cost. Now rather then 1 connection per service to event store, we have n connections where n = number of offices. And if we have x services we will now have n * x total connections.”

At this point I would look at using atom feeds for subscriptions as opposed to TCP. You can benefit heavily from putting a reverse proxy such as nginx in front of eventstore.

For failed projections there are generally two options:

  1. carry on but let people know the data may be incorrect (this is pretty easy to identify). Then fix the projection and replay it. Usually you can in code also figure out the scope of what data may be incorrect (eg limited to a single customer etc)

  2. stop

Most people prefer to do #1 especially if the data is intended for human use. If being used for automation purposes many prefer to use #2.

can you explain the benefit of a reverse proxy in this situation?

With 20k subscribers you can use http caching to your advantage reducing load to the eventstore and if geographically distributed lower latency for consumers on replays etc

ok thanks