Ah I understand, thanks for clearing that up. Unfortunately in this case the lastProcessedVersion + 1 check is the only thing preventing an inconsistent data model (because of skipped events).
Is this enough information for further investigation or do I need to test something in particular?
var receivedVersion = resolvedEvent.Link.EventNumber;
if (receivedVersion < expectedVersion)
{
Logger.Warn(“VolatileProjectionModule “{0}”: Skipping already processed event with version {1}, expected {2}”,
_contextKey, receivedVersion, expectedVersion);
return;
}
if (receivedVersion > expectedVersion)
throw new EventMissedException(expectedVersion, receivedVersion);
``
I already use the linkto event number as the used projection is for the whole bounded context stream, so only the bounded context id is relevant for us.
We are catching this EventMissedException and re-create the subscription, which “mostly” works fine, but it’s a pretty heavy workaround and occasionally leads to new bugs that would’ve not occured otherwise.
What I find note-worthy, every time these missed events happen, the first “wrong” call to EventAppeared is also called under a different Thread.
These appear in three different Azure environments (Production, Staging, Dev) in several similar but not identical components (Projection processing, “Workflow”/EventHandler processing and generic subscriptions).
As this happened quite a lot the last few days, I’d like to remind that this is still a critical issue for us.
The following scenario occured a few days ago (numbers X-ed for clarity):
EventAppeared invoked for eventNumber XXXX718, Thread ID 12, entered lock.
… processing begins for XXXX718
EventAppeared invoked for eventNumber XXXX728, Thread ID 37 (!), waits because lock is still being held by processing of XXXX718
… processing finished for XXXX718
EventAppeared invoked for eventNumber XXXX719, Thread ID 12, waits because lock was entered by the invalid invocation of XXXX728 after XXXX718 finished
… processing begins for XXXX728 which throws an exception because XXXX728 is not the next event number after XXXX718 (obviously).
These invocations were coming from the same subscription (checked via hash code) and were only milliseconds apart, so no timeout etc which could result in another invocation etc.
So…the same issue remains, why does the subscription just invoke a seemingly random event number under a different thread while the previous processing is still in process?
As mentioned previously the best way of getting people to spend time
looking at this issue (that only you appear to be having which leads
me to believe its not a main line case) is to provide a test that does
not use any of your code.
As previously stated two emails ago:
"I have not been able to reproduce this running against dev on linux."
Without being able to reproduce there is zero chance of anyone fixing
your issue.