Possible memory leak when writing many new streams / links

decima · January 27, 2023, 4:37pm

Heya! So I’ve got a few hundred small streams and I want to build a set of new streams out of them. I made a persistent subscription that reads $all, and when it sees a relevant event, appends a Link to the relevant output stream. This works fantastic! I’ve now got a few hundred new streams with my data organized nicely. The problem is that memory usage has ballooned from 1.5GiB to 20GiB. I ran a scavenge and it did nothing (there wasn’t anything to delete, but I figured it wouldn’t hurt to try). Rebooting ESDB resets it down to ~1.6GiB and stays there while idling.

I’m running a single instance using the eventstore/eventstore:21.10.8-buster-slim image in GKE and the only non-default config I’ve got is EVENTSTORE_CACHED_CHUNKS=3

How do I keep it’s memory use under control?

alexey.zimarev · February 2, 2023, 9:50pm

It’s probably the cache and the stream existence filter that grows when you create a lot of streams. When running ESDB in Kubernetes you need to ensure to set the pod limits accordingly. You can limit CPU to 2 and memory to 4GB. For stateful workloads it’s recommended to have request and limit setting the same. If it will work for you, set the CPU to 1 and memory to 2GB. ESDB will adjust the cache settings accordingly. I, however, don’t remember exactly when auto-tuning was added, so you might need to use the latest version.

decima · February 3, 2023, 5:57pm

Tried doing these settings and definitely saw the limits take hold. Before, CPU was idling at like 0.02 and mem at like 1.6G, now it’s idling near the requested limits (2CPU and 4GB), but it’s somehow running worse while consuming more resources? Before I could read a stream with 800 events basically instantly, now it’s taking ~40s. Same result with 21.10.8 and 22.10.0

decima · February 3, 2023, 6:49pm

OK so figured out what was going on there. I guess the larger streams that were reading quickly were being cached by ESDB and changing the resources rebooted the DB and cleared the cache. Is there a way to flag a stream as like “priority for cache”? The streams built for reading are well-known in this case. It’s not the biggest problem if not, since the 800 event stream is in my test data and the prod ones aren’t so crazy, but it’d be nice to be able to tune it for faster reads.

alexey.zimarev · February 4, 2023, 11:57am

Instant reads probably indicate that you read from the cache chunk. I don’t remember exactly what the chunk size is, but it might be 1GB. If you want three chunks to be cached, you need to adjust the memory limits accordingly. What you called a “leak” is not a leak, it’s a populated cache. The cache never expires by time, so you will not see any decrease in memory usage.

Reading 800 events should not take 40 seconds. Without knowing the details about your environment, it’s hard to say what the reason could be.

There’s no way to “flag” streams as ESDB doesn’t store streams in its cache, it stores the chunk files.

It’s possible to cache those streams on the application side. If you don’t have session stickiness, you’d probably want to ensure that the cache contains the latest version when you get a new operation for that stream. I have an issue for a proposed feature https://github.com/Eventuous/eventuous/issues/157

decima · February 21, 2023, 7:58pm

Hey again! After messing with it for a while, and getting distracted by other stuff for a bit, uncached reads are still super slow. What details would you need beyond what I’ve already provided? Currently, I’ve taken your suggestion of allocating 2 CPUs and 4GiB.

EDIT: Some things off the top of my head that might be relevant:
The application is running in Go, using v3 of the client library. The stream it’s requesting from is entirely linkTos (possibly part of the issue?).

decima · February 22, 2023, 4:38pm

Figured it out! Somehow my cluster got out of step with my terraform config, and the pd-balanced (an SSD) disk had been replaced with pd-standard (an HDD). Working great now that it’s using the correct disk

alexey.zimarev · February 23, 2023, 8:35am

Glad you found it. Indeed, when reading a stream that consists of links would lead to massive seeks as the server needs to resolve those links from a bunch of chunks. So, it will have substantial IO on the index, as well as random reads across the chunk files.