Compressing / making events unavailable

Just curious at this point -

I have a use case where I’ll have … a lot of events. So many that every 12 hours ES consumes ~200gb.

Are there any strategies / features I could use to “archive” old events I know will rarely be read back? Kind of like how graphite compresses old metric data to conserve disk space.

Right now I am toying with an idea to use a separate storage for a certain type of event that I know will decrease disk usage significantly - but it makes things much more complicated (distributed transactions)

The archived events won’t be read back (in rare cases maybe) because a snapshot of the stream exists. I still need the events to rebuild a snapshot (rarely) but I dont need to quickly access them

This thread in the group might provide some useful pointers: https://groups.google.com/forum/#!topic/event-store/moAS2MOeyxc

At any rate, there isn’t anything built-in to EventStore to accomplish what you’re after. The only features of Event Store that break the logical append-only model are deleting streams and setting a $maxAge property to cause old events to be eligible for scavenging. So if you want to “move” old events elsewhere you’ll have to copy them yourself and then allow them to be scavenged.

You might find a way to partition your data into time-boxed streams, then you could copy the data from whole streams to lower-cost storage and delete them from your primary.

There are some other things you can do $maxCount and there is $tb
which is truncate before (eg delete before a given point)

Thanks guys -
I think $maxCount would be best for this situation, the specific type of stream I am looking to archive doesn’t contain critical events… but it will still be hard to just delete them. For now that I’ll start using that, maybe if this comes up again I’ll look at making doing some sort of PR for archiving

Normally hot/cold storage is implemented in different places on
different mediums. If you want to ensure its perfect $tb is a better
option (subscriber that moves data sets it at intervals). If only
using maxCount its possible a writer could quickly push a bunch of
stuff before the subscriber has archived it.

This is the first time I have looked at stream metadata - the documentation for cacheControl is a little confusing.

If I have a stream of snapshots, in which the end of the stream changes only when a new snapshot is taken, would it make sense to set cache control here? Docs say “The head of a feed in the atom api is not cacheable” which I take to mean each time I ask for a snapshot its not cached at all?

cacheControl basically allows you to make the head of the stream
cacheable to n seconds. It is still not cacheable but if you have say
a reverse proxy in front of you and lots of clients you might like to
set it to 5 seconds (introduces up to 5 seconds of latency for
clients)