RetriesLimitReachedException

We’re seeing occasional EventStore.ClientAPI.Exceptions.RetriesLimitReachedException, a la

EventStore.ClientAPI.Exceptions.RetriesLimitReachedException: Item Operation ReadStreamEventsForwardOperation ({00aad7dd-4d0e-4b23-94ea-dc8657ccb60b}): Stream: Redacted_ffbb23ce-c221-46af-a676-000000000000, FromEventNumber: 0, MaxCount: 500, ResolveLinkTos: False, RequireMaster: True, retry count: 10, created: 00:58:56.944, last updated: 01:00:08.178 reached retries limit : 10

``

The particular circumstance under which we’re seeing this is a maintenance operation that rebuilds read models, so we’re iterating through many streams and reading all events in each stream, distributed over multiple threads on multiple machines. So, a higher than average read load for our environment.

This is an Azure environment, Windows Server 2012 R2 machines, ES HA 3.0.1 and the ES 3.0.0 client (which we’re now updating to 3.0.1). The particular environment where we’re observing this is a test environment with only one machine running ES.

Aside from “get faster disk” and “add retries” are there other things we can do more along the lines of tuning or optimization?

For example, we’re reading streams in chunks of 500 events, and our events are probably large relative to some use cases(they frequently contained encrypted data, which tends to bloat things.) 500 is arbitrary to us, we probably got it from sample code somewhere.

As a second example, we’ve also occasionally read messages of users with concerns about the amount of memory used by file system cache on Windows, and we’ve read discussions where people are advised to lower the max allowable % cache. These machines are dedicated to ES, and we’re happy that whatever the combination of Windows + ES determines is “The Right Thing” be the thing that we do on these machines. Does it matter that windows is mapping so many chunk files into memory? ES itself seems to consume a stable amount of memory, and on a (say) 14 GB machine we see most of the memory consumed by file cache, but intuitively this seems like a good thing. Perhaps it’s not, or perhaps it’s irrelevant.

Any thoughts appreciated. Note that from an Azure storage perspective, we are doing the optimal things in terms of multiple disks, storage account management, etc. (short of moving to the shiny new SSD backed storage or machines with SSD temp disks).

Brian

Hi Brian,
I think it is related to the chunk size of 500. I had the same problem, I too had a chunk size of 500 which made the transferred packages too large which resulted in connection issues, which lead to retries and in your case the RetriesLimitReachedException.

You should try setting this to 50 or 100 or something and test again. If this works, you could optimize by finding the sweet spot (200, for example) for your scenario.

Regards

Nicolas

+1

The amount is also dependent on the size of the events (if 200kb
events obviously this should be much smaller).

I would try 100 for page size and I am willing to bet the issue goes
away. Also check to make sure you don't have some very large (like MB
events) written accidentally etc

Greg

Thanks Nicholas and Greg - sounds very reasonable. We’ll give that a shot.

Brian

That did eliminate the behavior, thanks again.

Brian