Events dropped in catch up subscription (after going live)

Kristian_Freed · September 3, 2016, 1:25pm

I’ve experienced the above symptoms twice now in production, where a catch-up subscription (operating in live mode) appear to have silently dropped some messages. This is using Java client (Eventstore.JVM), and a bug has been filed there (see https://github.com/EventStore/EventStore.JVM/issues/62) but no feedback has been provided.

So far I have been unable to reproduce this in isolation, but messages have undeniable been lost at some point in my processing pipeline. The conditions that appear to trigger this is a combination of an incoming subscription, plus a lot of reads from EventStore (one full read of a stream per incoming message from the subscription). Processing of incoming messages during peek times is slower than the incoming message rate. It appears the combination of incoming events triggering lots of reads from EventStore is starving the connection and I see a missed heartbeat followed by reconnect, followed by missed events.

Running against a 3 node cluster of EventStore 3.7.0, with the 2.2.2 version of the Java client.

My question for the group is: 1) has anyone else seen this? 2) under what circumstances could a catch up subscription (in live mode) possibly drop messages? What are the expected semantics of the client after reconnect, I assume read from the stream to get missing events + subscribe live again?

Cheers,

Kristian

Greg_Young1 · September 3, 2016, 1:38pm

Catchupsubscriptions by definition should not drop messages.

" It appears the combination of incoming events triggering lots of
reads from EventStore is starving the connection and I see a missed
heartbeat followed by reconnect, followed by missed events."

This could possibly be an issue in the jvm subscription logic

Also if you have a missed heartbeat its not the connection being
starved. Heartbeats are only sent when no other messages are within a
time period.

Kristian_Freed · September 3, 2016, 1:47pm

What I’m seeing is

[default-akka.actor.default-dispatcher-3] - eventstore.tcp.ConnectionActor - Connection lost to /10.4.0.4:1113: no heartbeat within 2 seconds

I am not sure what the exact conditions for this being logged would be, code location is here: https://github.com/EventStore/EventStore.JVM/blob/master/src/main/scala/eventstore/tcp/ConnectionActor.scala#L159

For my understanding, is the expected behavior of a client on reconnect as I described, i.e: Subscribe to live updates, read any old messages that may have been missed, and after all old messages have been processed, feed through the live ones?

Cheers,

Kristian

Yaroslav_Klymko · September 3, 2016, 3:36pm

Hi Kristian, thanks for reporting, I’m looking into it now.

Catchup subscription should not drop messages even on reconnect.