Subscription "skips" events

Hi,
for weeks, I occasionally notice a weird problem with EventStore’s subscriptions.

In rare conditions, some events are just skipped. In one case, which happened a few minutes ago, after successfully processing event #3, EventAppeared for event #17 was invoked.

In the last code line of my EventAppeared handler I write the processed event number in a variable and check in subsequent EventAppeared-invocations if the received event number was incremented by 1.

The problem occurs under Event Store 3.0 stable, 3.0.1 stable and various dev+RC variants before 3.0 got stable.

In nearly or maybe all occurences, this happened when there are a lot of EventAppeared invocations … in the case of a few minutes ago, the subscription started fresh and had to process 500 events. “a lot” being relative here, but other than those 500-event batches there are mostly just a dozen events per minute or so.

Based on logging, I can rule out parallel access to my “last processed event number” variable.

Regards,

Nicolas

What kind of subscription? Catch up or just a subscription?

Hi Greg,
We only use catch up subscriptions

Can you provide something that can repeat it (even occasionally)?

I modified a few of my core components to be more loosely coupled and created a few unit tests that should be relatively near production usage (just with in-memory event store etc.) but were unable to reproduce it as of yet.
As far as I know, this only happened in Azure-hosted production deployments, not in local debug builds, so tomorrow I will try and test it inside azure.

Definitely need repro code to be able to come back with anything meaningful here. Can you post:

  • the construction of the catch up subscription

  • whatever is checkpointing your position

Can you also look (in the logs) to see if this happens after heartbeat timeouts?

James

After modifying a few components for easier testing and writing integration tests to test this particular problem, I got really weird results so I am still in the process of investigating a few things.

Seems like it is related to the Redis library that I use (StackExchange.Redis) from inside the EventAppeared handler. On rare occasions, due to a bug in that library, it is possible that the TPL hijacks a dedicated socket reader thread of said library, which “somehow” leads to situations where two threads concurrently execute EventAppeared and thus the counter for the last processed event number goes haywire and … basically … everything goes to hell.

To rephrase that, the same subscription of GES is invoking my EventAppeared handler from two different threads.

Though it would be nice if GES would somehow protect against this, I doubt that this a use case that should occur under normal situations and is probably only possible if the TPL itself messes things up.

For the time being, I switched to not using the async methods of said library, which does seem to stop the TPL thread hijacking and thus the disturbing of GES’s invocation of EventAppeared.

While the mentioned Redis library did create weird situations related to threading, it seems like it wasn’t the culprit for the GES-related problems.

I created the smallest possible subset of my source code including unit tests and after submitting this post will send you two each a copy of it.

To further explain what seems to be happening:

Information: [19:31:01:545][INFO] VolatileProjectionModule “test4”: Handled event of type BoundedContextStartedEvent with version 24, used subscription: 57243299, Thread ID: 55

Information: [19:31:01:571][INFO] VolatileProjectionModule “test4”: EventAppeared with eventNumber: 25, Thread ID: 55

Verbose: [19:31:01:611][VERBOSE] VolatileProjectionModule “test4”: Event appeared of type “BoundedContextStartedEvent”, received: 25, expected: 25, used subscription: 57243299, Thread ID: 55

Verbose: [19:31:01:667][VERBOSE] VolatileProjectionModule “test4”: Handling event of type BoundedContextStartedEvent with version 25, used subscription: 57243299

Information: [19:31:01:728][INFO] VolatileProjectionModule “test4”: Handled event of type BoundedContextStartedEvent with version 25, used subscription: 57243299, Thread ID: 55

Information: [19:31:01:798][INFO] VolatileProjectionModule “test4”: EventAppeared with eventNumber: 26, Thread ID: 55

Information: [19:31:01:832][INFO] VolatileProjectionModule “test4”: EventAppeared with eventNumber: 74, Thread ID: 54

Verbose: [19:31:01:857][VERBOSE] VolatileProjectionModule “test4”: Event appeared of type “BoundedContextStartedEvent”, received: 26, expected: 26, used subscription: 57243299, Thread ID: 55

Verbose: [19:31:01:925][VERBOSE] VolatileProjectionModule “test4”: Handling event of type BoundedContextStartedEvent with version 26, used subscription: 57243299

Information: [19:31:01:942][INFO] VolatileProjectionModule “test4”: Handled event of type BoundedContextStartedEvent with version 26, used subscription: 57243299, Thread ID: 55

Information: [19:31:01:959][INFO] VolatileProjectionModule “test4”: EventAppeared with eventNumber: 27, Thread ID: 55

Verbose: [19:31:01:960][VERBOSE] VolatileProjectionModule “test4”: Event appeared of type “BoundedContextStartedEvent”, received: 74, expected: 27, used subscription: 57243299, Thread ID: 54

``

Each processed event will result in 4 log lines, as is visible from the implementation: https://gist.github.com/nmehlei/71c63b103448562bbb74

The one with “EventAppeared with eventNumber” is logged right from the first line of my EventAppeared handler. Directly after it is a lock block. From inside that lock block, the three additional log lines are written in different stages of the event processing.

Since the subscription and used thread are visible from the log lines, one can see that the same subscription invokes the EventAppeared handler for eventNumber 26 and 74 concurrently over two different threads (55 and 54).

I don’t think this should happen.

It shouldn’t will look

not trying to bug you two, but did one of you find the time to take a look? Is the example project understandable?

Thanks,

Nicolas

I am more trying to figure out how you are getting 74 interleaved with 24/25/26

Is there any way we can further help with the finding of this?
Was it reproducible with the example project?

We have two more issues that might be related.

  1. Relatively frequent reconnects. As far as I can see, sometimes a few hundred times a day. At first I thought it might be related to the inherent nature of the Cloud, but since it also occurs locally, that can be ruled out.

  2. Sometimes, but very rare, Subscriptions just stop invoking its event handler. Possibly only after reconnects (just a theory), but there are no log entries that shed further light on this, no SubscriptionDropped invocations etc.

Both of them occur with the example project, too.

I will try to run it but in this thread I am only seeing this:
https://gist.github.com/nmehlei/71c63b103448562bbb74

Is the full example outside of the thread?

Exactly, I sent you and James the full example via Mail (November 11).

Let me look back at it. I will get it running locally.

what do logs say on reconnect (client and server)

Also what is the drop reason/message/exception when you see drops?

I was not able to diagnose this any further in the last few weeks, but the problem still persists.

I don’t get drop’s, except those that I initiate myself (because of the mentioned skips). (Sidenote: Maybe the ClientAPI should have this check integrated? There should not be a reason why the version of a processed event is not exactly lastProcessedVersion + 1).

Looking at the logs, I did find the following combination which seems odd:

13:30:30.097 Fri Feb 20 12:30:30.0644 +00:00 2015 ProjectionHost (MyHostname) : Info, PersistentProjectionModule “forum”: EventAppeared with eventNumber: 1, Thread ID: 46

13:30:30.097 Fri Feb 20 12:30:30.0644 +00:00 2015 ProjectionHost (MyHostname) : Trace, PersistentProjectionModule “forum”: Event appeared of type “ForumCreatedEvent”, received: 1, expected: 1, used subscription: 7340066

13:30:30.097 Fri Feb 20 12:30:30.0644 +00:00 2015 ProjectionHost (MyHostname) : Trace, PersistentProjectionModule “forum”: Handling event of type ForumCreatedEvent with version 1, used subscription: 7340066

13:30:30.142 Fri Feb 20 12:30:30.1241 +00:00 2015 ProjectionHost (MyHostname) : Trace, PersistentProjectionModule “forum”: latency was 1.4173962 s.

13:30:30.142 Fri Feb 20 12:30:30.1241 +00:00 2015 ProjectionHost (MyHostname) : Info, PersistentProjectionModule “forum”: Handled event of type ForumCreatedEvent with version 1, used subscription: 7340066, Thread ID: 46

13:30:30.142 Fri Feb 20 12:30:30.1241 +00:00 2015 ProjectionHost (MyHostname) : Trace, Catch-up Subscription to context-forum: processed event (context-forum, 1, $> @ 1).

// Looks good, 2nd event (=version 1) appeared, was handled/processed, everything went fine

13:30:30.142 Fri Feb 20 12:30:30.1241 +00:00 2015 ProjectionHost (MyHostname) : Trace, Catch-up Subscription to context-forum: finished reading events, nextReadEventNumber = 2.

13:30:30.142 Fri Feb 20 12:30:30.1241 +00:00 2015 ProjectionHost (MyHostname) : Trace, Catch-up Subscription to context-forum: subscribing…

// not sure why it says subscribing (should already be subscribed?) but nextReadEventNumber = 2 looks okay, version 2 should be the next appearing version

13:30:31.522 Fri Feb 20 12:30:31.4980 +00:00 2015 ProjectionHost (MyHostname) : Trace, Subscription {02dec189-c893-49cd-9676-bd4fd9aa4f22} to context-forum: subscribed at CommitPosition: 425148, EventNumber: 3.

// EventNumber 3 looks suspicious, as version 2 should be next

13:30:31.522 Fri Feb 20 12:30:31.4980 +00:00 2015 ProjectionHost (MyHostname) : Trace, Subscription {02dec189-c893-49cd-9676-bd4fd9aa4f22} to context-forum: event appeared (context-forum, 4, $> @ 427009/427009).

13:30:31.522 Fri Feb 20 12:30:31.4980 +00:00 2015 ProjectionHost (MyHostname) : Trace, Subscription {02dec189-c893-49cd-9676-bd4fd9aa4f22} to context-forum: event appeared (context-forum, 5, $> @ 430400/430400).

13:30:31.522 Fri Feb 20 12:30:31.4980 +00:00 2015 ProjectionHost (MyHostname) : Trace, Subscription {02dec189-c893-49cd-9676-bd4fd9aa4f22} to context-forum: event appeared (context-forum, 6, $> @ 430637/430637).

// Now it’s definitely going wrong, two and three were not processed and it received 4

13:30:31.534 Fri Feb 20 12:30:31.4980 +00:00 2015 ProjectionHost (MyHostname) : Trace, Catch-up Subscription to context-forum: event appeared (context-forum, 4, $> @ 427009/427009).

13:30:31.534 Fri Feb 20 12:30:31.4980 +00:00 2015 ProjectionHost (MyHostname) : Trace, Catch-up Subscription to context-forum: event appeared (context-forum, 5, $> @ 430400/430400).

13:30:31.535 Fri Feb 20 12:30:31.4980 +00:00 2015 ProjectionHost (MyHostname) : Trace, Catch-up Subscription to context-forum: event appeared (context-forum, 6, $> @ 430637/430637).

13:30:31.535 Fri Feb 20 12:30:31.4980 +00:00 2015 ProjectionHost (MyHostname) : Info, PersistentProjectionModule “forum”: EventAppeared with eventNumber: 4, Thread ID: 13

13:30:31.535 Fri Feb 20 12:30:31.4980 +00:00 2015 ProjectionHost (MyHostname) : Trace, PersistentProjectionModule “forum”: Event appeared of type “ForumCreatedEvent”, received: 4, expected: 2, used subscription: 7340066

13:30:31.535 Fri Feb 20 12:30:31.4980 +00:00 2015 ProjectionHost (MyHostname) : Error, PersistentProjectionModule “forum”: Error during projection processing, missed version, received: 4, expected: 2, previous: 1 FatalErrorCritical

// Here my app aborted processing, as it says in this line, the previous processed event was version 1 (correct), it expected version 2 (correct), it received version 4 (bad).
// Since this should never happen and processing can’t correctly occur with missing events, I throw an exception here.

13:30:31.535 Fri Feb 20 12:30:31.4980 +00:00 2015 ProjectionHost (MyHostname) : Trace, Catch-up Subscription to context-forum: pulling events (if left)…

13:30:31.550 Fri Feb 20 12:30:31.4980 +00:00 2015 ProjectionHost (MyHostname) : Trace, EventStoreConnection ‘ES-6cf04c4a-a197-457f-b4ff-ee60927331bf’: StartOperation schedule ReadStreamEventsForwardOperation, Stream: context-forum, FromEventNumber: 2, MaxCount: 50, ResolveLinkTos: True, RequireMaster: True, -1, 00:00:07…

13:30:31.550 Fri Feb 20 12:30:31.4980 +00:00 2015 ProjectionHost (MyHostname) : Trace, EventStoreConnection ‘ES-6cf04c4a-a197-457f-b4ff-ee60927331bf’: ScheduleOperation package ReadStreamEventsForward, dbec4268-5f4b-47a9-9802-173ba9db6eae, Operation ReadStreamEventsForwardOperation ({dbec4268-5f4b-47a9-9802-173ba9db6eae}): Stream: context-forum, FromEventNumber: 2, MaxCount: 50, ResolveLinkTos: True, RequireMaster: True, retry count: 0, created: 12:30:31.498, last updated: 12:30:31.498…

13:30:31.581 Fri Feb 20 12:30:31.5460 +00:00 2015 ProjectionHost (MyHostname) : Trace, EventStoreConnection ‘ES-6cf04c4a-a197-457f-b4ff-ee60927331bf’: HandleTcpPackage OPERATION DECISION EndOperation (Success), Operation ReadStreamEventsForwardOperation ({dbec4268-5f4b-47a9-9802-173ba9db6eae}): Stream: context-forum, FromEventNumber: 2, MaxCount: 50, ResolveLinkTos: True, RequireMaster: True, retry count: 0, created: 12:30:31.498, last updated: 12:30:31.498.

13:30:31.581 Fri Feb 20 12:30:31.5460 +00:00 2015 ProjectionHost (MyHostname) : Trace, EventStoreConnection ‘ES-6cf04c4a-a197-457f-b4ff-ee60927331bf’: RemoveOperation SUCCEEDED for Operation ReadStreamEventsForwardOperation ({dbec4268-5f4b-47a9-9802-173ba9db6eae}): Stream: context-forum, FromEventNumber: 2, MaxCount: 50, ResolveLinkTos: True, RequireMaster: True, retry count: 0, created: 12:30:31.498, last updated: 12:30:31.498.

// I am not exactly sure what those lines mean, but I guess the exception killed the previous subscription which lead to a re-subscription with the last processed version (2), so everything is more or less working fine.

13:30:31.581 Fri Feb 20 12:30:31.5460 +00:00 2015 ProjectionHost (MyHostname) : Info, PersistentProjectionModule “forum”: EventAppeared with eventNumber: 2, Thread ID: 46

13:30:31.581 Fri Feb 20 12:30:31.5460 +00:00 2015 ProjectionHost (MyHostname) : Trace, PersistentProjectionModule “forum”: Event appeared of type “ForumCreatedEvent”, received: 2, expected: 2, used subscription: 7340066

13:30:31.581 Fri Feb 20 12:30:31.5460 +00:00 2015 ProjectionHost (MyHostname) : Trace, PersistentProjectionModule “forum”: Handling event of type ForumCreatedEvent with version 2, used subscription: 7340066

13:30:31.620 Fri Feb 20 12:30:31.5958 +00:00 2015 ProjectionHost (MyHostname) : Trace, PersistentProjectionModule “forum”: latency was 1.8884917 s.

13:30:31.620 Fri Feb 20 12:30:31.5958 +00:00 2015 ProjectionHost (MyHostname) : Info, PersistentProjectionModule “forum”: Handled event of type ForumCreatedEvent with version 2, used subscription: 7340066, Thread ID: 46

13:30:31.620 Fri Feb 20 12:30:31.5958 +00:00 2015 ProjectionHost (MyHostname) : Trace, Catch-up Subscription to context-forum: processed event (context-forum, 2, $> @ 2).

13:30:31.620 Fri Feb 20 12:30:31.5958 +00:00 2015 ProjectionHost (MyHostname) : Info, PersistentProjectionModule “forum”: EventAppeared with eventNumber: 3, Thread ID: 46

// after re-subscription, version 2 appeared and (as it was the expected version), everything worked fine.

``

Log line 7 and everything afterwards should be relevant to this case. What would be the next steps to diagnose this further?

Regards

"There should not be a reason why the version of a processed event is
not exactly lastProcessedVersion + 1)."

There are reasons why the version of a processed event is not exactly
lastProcessedVersion + 1. TruncateBefore and maxage/maxcount can cause
this in conjunction with scavenging. They are not normal use cases but
there are edge conditions that can cause it :slight_smile:

Cheers,

Greg