Why is ES filling the disk with stats?

Hoegge · August 29, 2019, 6:06pm

We are using EventStore for a project in its early days. We have around 20.000 events of 500 bytes size each, which should be around 10 MB of data.

Since we started Evenstore in February the chunk files hav grown to a total of 25 GB (a factor 2500), which we discovered by chance !! What on earth is the point of filling the disk with all that useless information? Is it just to drive people into needing consultancy assistance when the system crashes? If you search, you discover you can decrease the stats collection amount and time range, but you still have to manually initiate scavenge and clean up. Why? Seems such a waste of bandwidth, hardware and energy - ultimately CO2 emissions.

The only information in docs is: "

Stats and debug information

Event Store has a lot of debug and statistics information available about a cluster you can find with the following request:"

not a word saying it will fill your servers with useless information if you don’t proactively clean up all the time (another example on the lack of good documentation on EventStore).

/hoegge

jen20 · August 29, 2019, 6:22pm

The statistics are saved such that you have some way to actually look at them and subscribe to them.

It is controlled by a flag that has been there since before version 1.

but you still have to manually initiate scavenge and clean up. Why?

Because scavenging removes data, and we don’t remove data without that being requested.

when the system crashes

Why would extra data cause a crash?

useless information

It’s not useless in the slightest - it’s in fact a fantastic way to instantly get a handle on how a cluster is performing under an actual workload without having to perform synthetic tests.

Greg_Young1 · August 29, 2019, 6:29pm

Whats the point of having it on by default? So when you say "OMG event
store sucks yesterday we saw ..." we can say "hey could you send us
the information in stream {...} from yesterday?"

These streams are marked with $maxCount I believe (might be $maxAge)
and will automatically be scavenged if you are running scavenge. There
are also command line options to control this behaviour
--stats-period-sec which is included here:
https://eventstore.org/docs/server/command-line-arguments/index.html
controls how often this is taken, putting a large value here should
alleviate any issues you have.

re scavenge being manual you can make it automatic quite easily, its
just an http post (there is also a wrapper in client API for the http
post). We figured people would rather use their own schedulers etc
than have to use a custom ES one that nobody has ever seen before ...
thus favoring "it works with whatever you already use...", in general
sending a parameterless http post as a scheduled task is a fairly
common operation no?

Hoegge · August 29, 2019, 7:01pm

Thanks for quick response - will try / do that. I think, though, that these things should then be stressed in the documentation. James: What flag since version 1 is controlling what? Is that the stats period in sec, you mean?

It seems, though, a bit counter intuitive, that files needs to be cleaned up and scavenged to remove stats, instead of leaving the event log files untouched / immutable (like a WORM drive). Doesn’t that also impact performance and stability and make it harder to work reliably on networked drives (which might be why you dis-encourage using then for data)?

best

Hoegge

jen20 · August 29, 2019, 7:06pm

The amount of data actually stored is not particularly significant - 25 gigabytes in what I assume (since you mentioned “early”) to be 5-6 years minimum is pretty reasonable… Most users run garbage collection for lots of reasons beyond this, and it has no impact on stability or performance (1 event every 30 seconds by default, controlled by --stats-period-sec).

If you’re running on network drives or a WORM, you’ll need to tune way beyond the statistics.

The reason we discourage using network attached drives is because of spiky latency, nothing related to statistics (though the statistics at least allow us to diagnose that when it occurs).

The documentation is open source and you can contribute if there are specific things you feel are unclear: https://github.com/EventStore/documentation

Hoegge · August 29, 2019, 7:17pm

The 25 gigabytes are since February this year. Of course we can manage 50 GB per year of stats data, but just seems wasteful to save by default in the same files as our domain events. But now we know and can minimize it.

thanks

Zvjezdan_Tomicevic · September 11, 2019, 8:45am

I am also wrestling with the issue a bit, but I am making progress.
I have 250GB db for six months even with stats adjusted. Currently we do not delete anything (except stats) and we use ES for storing aggregates in ES-CQRS fashion. We adjusted the stats and use 2TBSSD.

Scavenging takes up to 20 hours because it always starts from scratch, but there is an option to continue from a certain point. It needs custom service to be implemented. Would be great if some kind of GC was integrated in ES.

We just need to investigate our use-cases, details of ES implementation and best practices. Docs are lacking, and I intend to contribute to this issue ASAP with settings recommended for production.

That being said, ES works great in our case. It is rock-stable, we never had a single issue with data corruption or perf degradation due to data size. Starting up takes some time (even with checks disabled), but that is done rarely. We run 27 catch-up projections with subscribers for data denormalization to RavenDB, and have created some tools for .NET that automate stuff. Will share this as opensource too.

ES is integrated in our DDD stack and it works great. Our only issue is disk space consumption, but that is due to our lack of ES understanding.

ES is a beautiful piece of tech with great perspective.

Best regards

Greg_Young1 · September 16, 2019, 10:51pm

What do you mean by custom gc? Do you mean scavenging to be scheduled internally?

If so this had been discussed quite a few times. What features and capabilities would you want? Would you want internal workload knowledge involved?

There has been loads of discussion around varying ways of doing this. The problem is of course … getting it right which is nontrivial