SubscriptionDropped Never called for hosted ES instance

We have a couple applications that subscribe to the $all stream with filters that lose their connection every day or two and the SubcscriptionDropped callback is never invoked. The application is running on .NET 5.0 and using EventStore.Client.Grpc v21.2.0. Our EventStore instance in ES Cloud is 20.10, and we have a self-hosted instance of ES on the same version that sits within our vnet.

On our self hosted instance our connections are never dropped. With the same code deployed to our Azure environment connected to ES Cloud, this happens at least 2 or 3 times a week. The only way we’ve been able to resolve has been to restart the application, which reconnects to ES and continues processing from the last checkpoint that has been recorded.

This is what the test subscriber looks like: https://gist.github.com/teeroddesigns/4eb19238ead16c5ff4ed588d4b7014ac

I started this up on Thursday afternoon (Aug 12, 2021) and when I came back this morning and created some new events they were not picked up by the application. Furthermore, there were no Error level logs that I would have expected from the SubscriptionDropped callback.

We have had the same code running against our self hosted environment for well over a year and have never once ran into this issue. We have two hosted ES clusters and this is an issue for both environments. Is there something I’m missing here?

Your application is deployed to Azure. What about the cluster itself?

The clusters we are connecting to are EventStore Cloud on Azure in central US. We have a testing instance that is single-node F1 and a production instance that is configured as a Three Node Multi Zone. This issue occurs in both environments.

This can be related to networking issues or delays and timeouts between the self-hosted client and Event Store Cloud. The general recommendation (not only for ESC) is to have a retry/reconnection policy in the application code.

See the sample code of how to achieve that without restarting the server: https://github.com/EventStore/samples/blob/main/CQRS_Flow/.NET/Core/Core.EventStoreDB/Subscriptions/SubscribeToAllBackgroundWorker.cs#L156.

We do have code to resubscribe to in the SubscriptionDropped callback in our applications. In the Gist I sent over I’ve simplified the code to simply log an error if the callback is executed, but it never occurs. For the code in the example to work wouldn’t that callback still need to be executed?

That’s unusual. Could you provide more details about your setup? Are you using load balancers or proxies? Could you provide logs?

Regarding the sample. Indeed, to trigger resubscribe, the callback needs to be called.

Sure! We are running our applications on a private AKS cluster. We do use HAProxy as a load balancer for our WAFs, either of which could potentially be altering traffic from our apps. On our development / self hosted EventStore instance, where we don’t see this issue occur, the EventStore server is running on the same virtual network as the AKS cluster so there are no proxies between the apps and ES.

Our network admin is out until Thursday so I’m not sure about the specifics for configuration or logs that we might have here.

We have verified that the network data from our K8 pods does not pass through our WAF. DNS resolves our hosted ES cluster to an IP address in the peered network on Azure and the traffic flows only over the peer network. Are there any settings or configuration considerations for running an event store client subscription from AKS to ES Cloud?

There is a blog post and docs with the details on configuring AKS to ES Cloud written by @alexey.zimarev. See:


https://developers.eventstore.com/cloud/use/kubernetes/aks.html

After you check them, I’d be grateful for the feedback if it’s enough or something is missing :slight_smile:

AKS was actually quite straightforward with both CNI and kubenet. When networks are correctly peered, you can deploy a busybox ephemeral container and try to ping any of the ESDB cloud nodes.

Thanks for the link. There is a slight difference in our network setup - we have a subnet 10.252.0.0/17 peered with our EventStore network, and our VNET address space is 10.252.0.0/16. All of our AKS nodes sit within the peered subnet address space. We do have connectivity to our ES cluster from our peered subnet and the subscriptions from AKS pods to hosted ES work initially. It’s after a day or two that the connection drops. I think this only occurs when there are long gaps between events, but it’s not very easy to figure out what that time period is. We don’t see this happen during periods or regular usage, and I would expect that keepalive pings would keep the connection alive, or that SubscriptionDropped would be called so we could resubscribe.

I set up an experiment that is interesting.

I created a docker image that subscribes to the $all stream and logs a message on EventAppeared. I started up a pod in our AKS cluster using this image and I also started a container on a VM that is on the same subnet. I waited 2 days then added a new event to a stream. The container running on the VM is still receiving events, but the pod running in the AKS cluster is not. I’m not sure what to do with that information yet, but am open to ideas if anything stands out. I’ve enabled verbose logging on these so hopefully I’ll get a little more information when this happens next