Hi,
AFAIK, when the leader node changes, the persistent subscription receives events from last checkpoint, which means some events could be duplicate.
We currently add a new checkpoint every 2 seconds to minimise the possibility of receiving duplicate events when the leader node is changed. I guess this might cause the poor performance on the scavenge and also general operations.
I’m considering to add a new stream to store the last acknowledge position or using Redis cache for it.
Is there a better way for the persistent subscription to start subscribing events from the last acknowledge position instead of the last checkpoint?
You would indeed need to keep the checkpoint yourself and compare the checkpoint with the received event once resubscribed:
though
- acking an event in the persistent sub and storing it as a checkpoint yourself will not be transactional .
- So store before ack (store= process the event & store your checkpoint)
- If the received event version or position is lower than your checkpoint , simply do not pass it to the code handling the event.
- this strategy will cause problem if some events need to be resent from the parked event stream.
- this strategy will cause problem because persistent subscription do not guarantee delivering events in order ( delivering in order is a best effort)
Can you explain what you’re doing in the subscription ?
Depending on that there might be different strategies to use for checkpointing and or idempotent processing
This could help as well ( links to our community discord )
I have been thinking about this problem for myself in our use cases. The conclusion is that it is probably better to manage the checkpoint in our code (using e.g. JGroups to sync between concurrent workers), or even better to not store the checkpoint at all.
The idea is that whenever possible the workers will be translating the events to calls to some other API. If that API is transactional itself, then on startup/recovery the workers should be able to ask the API what is the latest event id/revision it has seen, and then the workers can start from there. This seems like the best way to get a performant and correct resume to work, but it relies on the API being integrated with to have the functionality to return this to the worker. In our case the API being integrated with is either another EventStore or a named projection, which has a name+revision stored for each update.