Compressing / making events unavailable

Charles_Solar · September 26, 2016, 9:48pm

Just curious at this point -

I have a use case where I’ll have … a lot of events. So many that every 12 hours ES consumes ~200gb.

Are there any strategies / features I could use to “archive” old events I know will rarely be read back? Kind of like how graphite compresses old metric data to conserve disk space.

Right now I am toying with an idea to use a separate storage for a certain type of event that I know will decrease disk usage significantly - but it makes things much more complicated (distributed transactions)

Charles_Solar · September 26, 2016, 9:58pm

The archived events won’t be read back (in rare cases maybe) because a snapshot of the stream exists. I still need the events to rebuild a snapshot (rarely) but I dont need to quickly access them

Mike_Schenk · September 27, 2016, 8:10pm

This thread in the group might provide some useful pointers: https://groups.google.com/forum/#!topic/event-store/moAS2MOeyxc

At any rate, there isn’t anything built-in to EventStore to accomplish what you’re after. The only features of Event Store that break the logical append-only model are deleting streams and setting a $maxAge property to cause old events to be eligible for scavenging. So if you want to “move” old events elsewhere you’ll have to copy them yourself and then allow them to be scavenged.

You might find a way to partition your data into time-boxed streams, then you could copy the data from whole streams to lower-cost storage and delete them from your primary.

Greg_Young1 · September 27, 2016, 8:25pm

There are some other things you can do $maxCount and there is $tb
which is truncate before (eg delete before a given point)

Charles_Solar · September 27, 2016, 9:27pm

Thanks guys -
I think $maxCount would be best for this situation, the specific type of stream I am looking to archive doesn’t contain critical events… but it will still be hard to just delete them. For now that I’ll start using that, maybe if this comes up again I’ll look at making doing some sort of PR for archiving

Greg_Young1 · September 27, 2016, 9:31pm

Normally hot/cold storage is implemented in different places on
different mediums. If you want to ensure its perfect $tb is a better
option (subscriber that moves data sets it at intervals). If only
using maxCount its possible a writer could quickly push a bunch of
stuff before the subscriber has archived it.

Charles_Solar · September 28, 2016, 1:04am

This is the first time I have looked at stream metadata - the documentation for cacheControl is a little confusing.

If I have a stream of snapshots, in which the end of the stream changes only when a new snapshot is taken, would it make sense to set cache control here? Docs say “The head of a feed in the atom api is not cacheable” which I take to mean each time I ask for a snapshot its not cached at all?

Greg_Young1 · September 28, 2016, 7:05am

cacheControl basically allows you to make the head of the stream
cacheable to n seconds. It is still not cacheable but if you have say
a reverse proxy in front of you and lots of clients you might like to
set it to 5 seconds (introduces up to 5 seconds of latency for
clients)