Correct way to disconnect from persistent subscription

Anand · May 8, 2017, 2:10pm

What is the correct way to disconnect from a persistent subscription? And does the event store .net api attempt to automatically connect to the persistent subscription under any circumstances?

After long periods of inactivity (greater than 24 hours), we see multiple open connections to the subscription group. I see some subscriptions with around 100 connections !!!

Hayley-Jean_Campbell · May 8, 2017, 2:39pm

To disconnect from a persistent subscription you need to call Stop() on the subscription. And Event Store should not automatically connect to a persistent subscription.

How are you measuring the number of connections to the subscription group?

And how are your clients connecting to your persistent subscription?

Anand · May 8, 2017, 2:46pm

Yep. That is what we call.

I am looking at the connections for the subscriber group in the competing competing consumers web page.

We are using ConnectToPersistentSubscriptionAsync to connect.

And I think this problem is just specific to our clients running in a virtual machine after long periods of inactivity.

Anand · May 8, 2017, 2:48pm

Also is it possible to just disconnect all connections to a particular subscriber group? We have 100 connections for a client and stopping the client just reduces the connections by 1. The only way to get rid of the other 99 is to restart the eventstore.

Our eventstore is running in Azure.

Greg_Young1 · May 8, 2017, 2:52pm

This sounds like you are leaking connections. Taking down the client
process would also close them.

Anand · May 8, 2017, 2:55pm

What causes that? And taking down the client process just reduces the number of connections by 1. The only way to get rid of all connections is to shut down the event store.

Greg_Young1 · May 8, 2017, 2:57pm

The UI tells you where the connections are from and how are you measuring this?

Anand · May 8, 2017, 3:17pm

I am looking at the connections column in the competing consumers page.

I did not check the source of connections today but when it occurred previously the connections were all from the same machine where the client was running (we had no other client running on that machine). The client did not receive any events till we restarted it. Also I have a suspicion this is related to virtual machines, since the subscription groups connected from the base machine do not have this issue.

Anand · May 17, 2017, 2:38pm

The issue occurred again and I grabbed some screenshots

CC1.Png - shows the main connections page. There is only one connection
CC2.PNG - Shows the Competing consumers page. We connect to each subscriber group from a separate machine. As you can see, the number of connections are all over the place. They should all be 1. Some of them show 100 which corresponds to the maximum subscribers setting.
CC100.txt - shows the connection info for a particular subscriber group that shows 100 in the CC web page.I have replaced the actual IP with 0.0.0.0 (i can send it through email if needed).

The IP in CC100.txt is different from the IP in CC1.PNG. So some subscriber group connections are not being cleaned up when the connection drops.

Thoughts on why this is happening and how to avoid it? As I said before, i have a feeling this is limited to VMs.

CC100.txt (25.5 KB)

Hayley-Jean_Campbell · May 18, 2017, 6:33am

Would you be able to share a sample of the code you are using to connect to your persistent subscriptions, or provide us with a test case?

Anand · July 17, 2017, 1:56pm

Hey, I was able to replicate the issue with the attached test service. Steps to reproduce -> Install the service and run it over the weekend. > 24 hours of inactivity results in orphan connections showing up and event delivery has issues. The only solution we have found so far is to restart the event store every Monday.

OrphanConnectionTest.zip (1.02 MB)

Greg_Young1 · July 17, 2017, 2:02pm

Can you include a log during this period (it obviously will take us a while to run the test).

Anand · July 17, 2017, 2:48pm

I have attached the logs.

EventStore Logs.zip (5.6 MB)

Greg_Young1 · July 17, 2017, 3:22pm

Which date did the problem happen and on what setup are you running? I am noticing some odd messages just going through such as

[PID:09600:011 2017.07.14 04:33:48.184 ERROR QueuedHandlerMresWit] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 79642ms. Q: 0/2.

[PID:09600:011 2017.07.14 04:33:55.981 ERROR QueuedHandlerMresWit] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 7797ms. Q: -1/0.

[PID:09600:011 2017.07.14 04:34:49.904 ERROR QueuedHandlerMresWit] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 51251ms. Q: 0/1.

[PID:09600:011 2017.07.14 08:32:51.506 ERROR QueuedHandlerMresWit] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 10297ms. Q: 0/0.

[PID:09600:011 2017.07.14 08:39:02.469 ERROR QueuedHandlerMresWit] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 351228ms. Q: 0/11.

[PID:09600:011 2017.07.14 08:39:10.063 ERROR QueuedHandlerMresWit] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 7594ms. Q: -10/0.

[PID:09600:011 2017.07.14 08:39:31.657 ERROR QueuedHandlerMresWit] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 20063ms. Q: 0/0.

79 seconds to write data to disk is rather unusual.

Anand · July 17, 2017, 3:36pm

The service has been running since 14th. The problem should have occurred on the 15th evening or 16th. I checked on the 15th (around noon) and there were no orphan connections.

We are running a single instance of event store 4.0 (EventStore.ClusterNode.exe --db ./db --log ./logs --ExtIp 0.0.0.0). We are also using the open source version (in the process of acquiring commercial license)

Anand · July 25, 2017, 8:14pm

Have you had a chance to look into this issue?

Hayley-Jean_Campbell · July 26, 2017, 6:40am

Hi Anand,

I have not yet had a chance to test this properly, but have looked through the code sample.

Could you please try calling Close() on the subscription when it is dropped and see if that helps any? The connection code currently doesn’t do this.

For example:

var result = await _connection.ConnectToPersistentSubscriptionAsync(
streamName,
groupName,
(subscription, resolved) => Task.Run(() => OnEventAppearedPersistentStream(subscription, GetSubscriptionKey(streamName, groupName), resolved, false)),
(subscription, dropReason, exception) =>
{
subscription**.Stop(timeout);**
Task.Run(() => OnSubscriptionDropped(streamName, groupName, dropReason, exception));
},
null,
Environment.ProcessorCount,
false);

``

Also, how often is your connection to the persistent subscription being dropped and reconnected?

If there are any, what causes the subscription connections to be dropped?

Anand · July 26, 2017, 3:10pm

I will run the test with Stop and see what happens.

Based on our test program logs the subscriptions dropped and reconnected around 15 times from the 13th to 17th July. The connection dropped by a similar count. The connection drops may be related to the heartbeat getting timed out. We used to have connection drops often and Greg recommended we increase it.

Hayley-Jean_Campbell · July 27, 2017, 9:51am

Hi Anand,

After more testing today, we believe we have found the bug causing of your issue.

When the client attempts to create a connection to a persistent subscription, it will send off a request and wait for a response. If it does not receive a response within the OperationTimeout, it will retry the connection operation.

Since you have specified the KeepRetrying setting, the client will retry this connection operation until it eventually succeeds.

In addition to this, the Persistent Subscription Service on the Event Store server does not track which client operation caused the connection operation. This means that the server will create a connection for each retry.

From the logs you provided, your Event Store server was experiencing a lot of slow queues during the time you saw the extra connections being created.

This means that a lot of the connection operations were likely timing out, causing them to be retried, which would cause multiple connections to be created all at the same time.

An issue for this has been added to github, you can track it here.

Anand · July 27, 2017, 1:24pm

Hi, thanks for the response. Do you have an ETA for the fix? We will be using eventstore in production environment in a few weeks.