Some Persistent Subscription subscribers sitting idle when messages are available

Napoleone_Piani · July 11, 2019, 1:31am

Hello!

We are currently facing a problem in the consumption of events from our persistent subscriptions. At any point in time, many of our subscribers are sitting idle with all their available slots while there are still many events to handle.

We have a large amount of events in many streams being funneled into larger streams via the $by_category projection to which our persistent subscription is registered to, which our clients are then subscribed to consuming these events. Because of the large amount of events, we (horizontally) scaled up the consuming service to attempt to get through these quicker. We currently have 10 instances of our service trying to consume these.

Here is the configuration for the persistent subscription:

“config”: {
“resolveLinktos”: true,
“startFrom”: 0,
“messageTimeoutMilliseconds”: 25000,
“extraStatistics”: false,
“maxRetryCount”: 10,
“liveBufferSize”: 500,
“bufferSize”: 500,
“readBatchSize”: 20,
“preferRoundRobin”: false,
“checkPointAfterMilliseconds”: 2000,
“minCheckPointCount”: 10,
“maxCheckPointCount”: 1000,
“maxSubscriberCount”: 100,
“namedConsumerStrategy”: “Pinned”
}

``

A sample response from the persistent subscription’s knowledge of our clients:

{
“from”: “instance-1-ip”,
“username”: “some-username”,
“averageItemsPerSecond”: 0,
“totalItemsProcessed”: 39809,
“countSinceLastMeasurement”: 0,
“extraStatistics”: [],
“availableSlots”: 50,
“inFlightMessages”: 0
},
{
“from”: “instance-2-ip”,
“username”: “some-username”,
“averageItemsPerSecond”: 17,
“totalItemsProcessed”: 42887,
“countSinceLastMeasurement”: 14,
“extraStatistics”: [],
“availableSlots”: 0,
“inFlightMessages”: 50
}

``

As you can see, one is at 50 in flight messages, and the other is at 0, while there are thousands (event hundreds of thousands) of events to process still.

We know there are events being NAK’d and retried by the services, but I would still think other services could be consuming the remaining events?

Any help tracking down what the problem is would be helpful.

Thank you!

Napoleone

Greg_Young1 · July 11, 2019, 2:15am

So what pinned does is this:

First get which bucket it should go to ...
private uint GetAssignmentId(ResolvedEvent ev) {
string sourceStreamId = GetSourceStreamId(ev);
return _hash.Hash(sourceStreamId) % (uint)_state.Assignments.Length;
}

Note that for linkTos it will use the actual source stream (not the
stream of the linkTo)
https://github.com/EventStore/EventStore/blob/master/src/EventStore.Core/Services/PersistentSubscription/ConsumerStrategy/PinnedPersistentSubscriptionConsumerStrategy.cs#L83

This picks which bucket to put it in. Note that the stream id is what
is used for choosing the bucket! This does its best to ensure that all
events in stream "foozbazbar" end up at the same consumer (note this
also *normally* provides proper ordering for this stream at the
consumer!!! retries etc can cause out of order but initial selection
is *in order*...).

If the events are from the same stream (or only a few streams) they
could easily all end up in the same bucket. Also if there are a large
number of events written to a few streams which hash to the same
bucket it could end up in this case. This moving to a designated
client is the described behaviour of the pinned strategy. It is used
when ordering is needed and appears to be behaving by design. Likely a
round-robin strategy with less assurances might be more what you are
looking for?

Greg

Greg_Young1 · July 11, 2019, 2:26am

Just to correct:

"This moving to a designated
client is the described behaviour of the pinned strategy. It is used
when ordering is needed and appears to be behaving by design."

This moving to a designated
client is the described behaviour of the pinned strategy. It is used
when locality is desired and appears to be behaving by design.

If not clear on this it is very easy to imagine a consumer that does
something which would require a lock on say the "object" associated
with the stream. This strategy will attempt to get all messages for
the same "object" to the same consumer which would minimize lock
contention and under any normal circumstances also provide ordering
for that consumer though the ordering is best effort (there are things
which can break the assurance).

Napoleone_Piani · July 11, 2019, 4:54pm

Hi Greg. Thanks for the quick reply!

It seems from your explanation that the Pinned strategy is doing what we are expecting it to do. We switched from Round Robin to Pinned because we were attempting to reduce the number of concurrency issues when running events from the same stream.

We have a check in our consumers that saves the last event number successfully handled and NAKs with a retry if the event received is not the next event to run, in the case where an event is picked up by the “wrong” consumer.

I’m still missing why some of our consumers are sitting idle though. Our data consists of several thousand streams in the form “Foo-SOME_GUID” (ex. Foo-f7b43087-acc9-4e5e-bd69-fc9c2e75784a), each which has 200-500 events where the ordering of handling these is important.

Would the whole consumption of events from the $ce-Foo stream (by one persistent subscription) be stalled if there is a backup (because of errors and retries) of events from any of the streams?

Thanks!

Greg_Young1 · July 11, 2019, 5:38pm

In the case of a subscription to a stream of linkTos it should be
using the origin stream of the event to decide which consumer gets it
(I verified this, as it was years ago I wrote this :))

From the description of your situation it sounds like you have a whole
bunch of events coming back to a single stream (or multiple that end
up in the same bucket) along the way which is why they are not getting
fanned out. An example of where this could happen is if you did an
import of 1m events into stream Foo. When reading from all at this
point they would all under this strategy end up going to a single
consumer (the stream name is what determines where they are written).
The reason for this is that the *stream name* is what is used in the
distribution.

This problem obviously would not exist with the other strategies...

Napoleone_Piani · July 11, 2019, 7:36pm

By “origin stream” of an event in this case do you mean the Foo-GUID stream, or the $ce-Foo stream?

If it’s the latter then I definitely see why this is happening! If it’s the former I thought a GUID as part of the name would potentially provide sufficient distinction to distribute the messages, but that was an assumption.

Greg_Young1 · July 11, 2019, 8:05pm

The foo-{guid} (eg the original event stream not the stream of the linkTo)

https://github.com/EventStore/EventStore/blob/master/src/EventStore.Core/Services/PersistentSubscription/ConsumerStrategy/PinnedPersistentSubscriptionConsumerStrategy.cs#L88

Is where it is determined. It gets the original stream name.