Catchup subscriptions - lost events

Hello,
We have a 3 node cluster v21.10 with one read-only replica, hosted on Linux VMs on the Azure.
We use dotnet gRPC client, NuGet: EventStore.Client.Grpc.Streams\21.2.0.
We have 2 catchup subscriptions to ALL stream (read-only replica) from 2 different services.

When we have a load test with 40000 (40k) events within 20 minutes everything works correctly, every event is processed by both catchup subscriptions. But when we tried a bigger load 80000 (80k) events within 40 minutes, one subscription received all events, but the other one did not (6 events were not received). It happens every time with a bigger load that some events are lost on one of the subscriptions.

Do you have any idea about this issue?

I have a production system that has a catch-up subscription processing about 1 million events within an hour, and I never saw it missing any events. How do you ensure that you received all the events that you published?

1 Like

Also, have you only tried a replica, or you have the same behaviour with subscribing to any of the cluster nodes?

Hi Alexey, thanks for your response. Yes, first we tried catch-up subscription on cluster nodes and after we observed the issue we tried on replica node. We log info about each event that appeared on our subscription. When we observe that some events are not processed in our read database we check for logs and see that event is not actually received by subscription.

We use the latest EventStore dotnet client 22.0.0. The way we are subscribing is:

_subscription = await _esClient.SubscribeToAllAsync(start: FromAll.After(position),
              eventAppeared: ProcessEventAsync,
              subscriptionDropped: SubscriptionDropped,
              filterOptions: new SubscriptionFilterOptions(
                        EventTypeFilter.ExcludeSystemEvents()));

private async Task ProcessEventAsync(StreamSubscription subscription, ResolvedEvent evnt, CancellationToken token)
        {
            _log.LogInformation("Starting processing EventType: {0} , EventNumber: {1} , StreamId: {2}", evnt.Event?.EventType, evnt.Event?.EventNumber, evnt.OriginalStreamId);

                ............
        }

It is very tricky to figure out what is going wrong. The subscription was never dropped, but some events never entered our code.

I am having the same problem since I updated to 22.0.0. I am logging each event that appears and I can see that some does not appear that should.

@nikolla85, @philip.oterdahl
What is the log sink you’re using ?
( i.e where are those logs written to & by what library )

When I use TCP client ‘EventStore.Client 21.2.2’ for catch-up subscriptions I do not lose any event. But with gRPC client yes. Logs get written to Azure ApplicationInsights. I am using Serilog.

WebHost.CreateDefaultBuilder(args)
.UseSerilog((hostingContext, builder) =>
{
builder = builder.ReadFrom.Configuration(hostingContext.Configuration)
.Enrich.FromLogContext()
.WriteTo.Console()
.WriteTo.ApplicationInsights(…);
}

What are the volumes and throughput rates? If I create a sample test with 10000 events, would I experience this issue? Are you connecting to the leader or the follower node?

I am running locally logging to console.
When I run a query in the EventStoreDb UI I can see that there should be 99 of a certain type of event but when logging the events that appeared of that same type in the catch up subscription to all I only get 95 events but I still get new events so it is like 5 events just gets skipped.

This varies as well it’s all between 90-96 events that appears out of 99.

I did not get this problem when I ran the version 21.10.0 of event store db in docker and used the 21.20 grpc nuget.package. I might be missing something or have done something when I updated but I do not know what.

I have less than 10000 events and I am using a single node in docker.

Somehow there is no rule. I had a test with half a million events during 3 hours and everything was ok. Sometimes I encounter the issue after 20k events within 10 minutes, sometimes after 40k within 20minutes, sometimes after 60k within 30 minutes. There is no rule. When we connect to ES we use DNS, so we are not sure which node accepts the request. Connection string example: esdb+discover://admin:changeit@DNS:2113
When we want to do with read-only replica we specify ?nodePreference=readonlyreplica

With Tcp client we have not lost any event so far.

One possible issue is a difference in behaviour of subscriptions, which I experienced. When in the gRPC catch-up subscription EventAppeared fails, there’s no visible sign of that. But, the subscription stops. Definitely it’s not the behaviour you observed, but I’d try shovelling events somewhere else (not AppInsights) in the same VPC (files, Elastic, etc). It might be that events have came but weren’t successfully propagated to the sink.

I will try to test a long-running subscription to see if I can reproduce it.

When we connect to ES we use DNS, so we are not sure which node accepts the request. Connection string example: esdb+discover://admin:changeit@DNS:2113

That’s gonna connect to the leader node

Any news regarding this? Is there maybe a relevant GitHub issue?
We’ve just encountered the same issue.
EventStoreDB 22.10.1.0 (Single node, running on Windows Server) with EventStoreDB-Client-Java (gRPC) 4.2.0.

@jezovuk could you share more on what type of event you were sending? Do you know for example how large their payload might be?

Are we talking about a regular subscription or a subscription to $all?

We don’t have any update as we have never been able to reproduce the issue.

Events are 10-30KB JSONs. Catch-up subscription (regular, single stream, not to $all). We’ve observed several occurrences under similar circumstances:

  • ~15 data streams being written to (serial append, one stream at a time)
  • 1 stream holds links to events from multiple data streams, including above-mentioned data streams
  • several concurrent threads doing heavy reading from the aggregated link-stream
  • each of the above threads initiates catch-up subscription to exactly one data stream once it ‘finishes’ the aggregate stream read
  • dropped event has always been the first event from data stream (event #0), with catch-up subscription starting from stream start (first received event was event #1)

Upon another attempt (same data, same streams), everything worked as expected.