I’ve experienced the above symptoms twice now in production, where a catch-up subscription (operating in live mode) appear to have silently dropped some messages. This is using Java client (Eventstore.JVM), and a bug has been filed there (see https://github.com/EventStore/EventStore.JVM/issues/62) but no feedback has been provided.
So far I have been unable to reproduce this in isolation, but messages have undeniable been lost at some point in my processing pipeline. The conditions that appear to trigger this is a combination of an incoming subscription, plus a lot of reads from EventStore (one full read of a stream per incoming message from the subscription). Processing of incoming messages during peek times is slower than the incoming message rate. It appears the combination of incoming events triggering lots of reads from EventStore is starving the connection and I see a missed heartbeat followed by reconnect, followed by missed events.
Running against a 3 node cluster of EventStore 3.7.0, with the 2.2.2 version of the Java client.
My question for the group is: 1) has anyone else seen this? 2) under what circumstances could a catch up subscription (in live mode) possibly drop messages? What are the expected semantics of the client after reconnect, I assume read from the stream to get missing events + subscribe live again?
Cheers,
Kristian