Help needed in reducing storage footprint of Event Store

Doug_K · December 11, 2019, 12:01am

As Event Store operates the storage footprint will continue to grow forever - is that right? If so, what is the recommendation to mitigate the case of running out of disk space. We would like to either partition onto multiple disks and/or be able to move event streams to an archive instance of EventStore, with the ultimate goal of not growing storage infinitely over time.

Can you please advise on the right strategy for this? I can’t find anything exactly addressing it in the documentation. In our case, we have an upper limit of drive size.

Thank you

Doug

Joao_Braganca · December 11, 2019, 12:53pm

Scavenging can help. You should also try to avoid ‘forever streams’ in your domain if possible. Most aggregates have a natural lifecycle, e.g. an accounting system will have procedures to close the day, the month, the quarter, and the fiscal year. Use this as an opportunity to archive these streams to ‘cold’ storage - tape drive, s3, etc.

Greg_Young1 · December 11, 2019, 3:23pm

There are a different ways, it depends on your data. Its more of a conceptual understanding/modelling decision in most cases.

Can you say more about how your data works?

Is it streams that continue growing?

Is it lots of new streams being created and essentially closed over shorter periods? Packages at UPS would be a good example of this. The packages don’t live long but there are lots of them!

Is there data that is not interesting over time?

Is there a natural time boundary in the domain? EG: with accounting things roll over on years for the most part. There is a year open/year close (sometimes even period based which is smaller).

There are lots of methods here but they tend to have interaction with your domain itself. What data is “hot” in the domain at a given time? Are there explicit points where data becomes “cold”?

Also its quite common to find that only a few use cases make up the vast portion of your data (say 5 use cases or 2 aggregate types equate to 80% of data). Is this the case in your domain or is the distribution fairly flat across many events/aggregates?

There isn’t really a right “textbook” answer here its more model based.

Cheers,

Greg

Doug_K · December 11, 2019, 6:21pm

Thanks for your quick replies.

In our case we have an application where each stream tracks a “project” from initiation to close out, and all data about that project. We can see hundreds of new projects/streams each day, and over the course of each project’s 4 to 6 month life-cycle it can accumulate tens of thousands of events (sometimes more). We have other aggregate types, but the core project streams are far and away the highest data user in our system. After closing out, we would prefer to retain the event log of each project, but do not need to keep a “hot” writable stream.

I am struggling with the actual mechanics of how we might archive this “cold” data though in a way that let’s us remove from one disk and partition to a set of new disks.

Based on your feedback, should we build something along the lines of:

Each year we provision a new EventStore instance for this application (“2019 Project Archive”)
Run a nightly bulk job that looks for closed projects (closed since a reasonable grace period)
Copy each stream to the Archive instance
Issue a Tombstone/Hard-Delete event on each archived stream
Execute scavenge which will release the storage of Tombstoned streams

Meaning any hypothetical new subscriber running a catch up subscription, or a replay requirement, might need to work through the archived projects if they have a need to include closed projects. I’m also assuming (as common sense tells me) that a “soft-delete” + scavenge wouldn’t be able to reclaim disk space as the events still are retained, and also we’d need some strategy for the unavoidable ask to resurrect a project in order to add details or new events (such as allowing edits in the Archive EventStore or building an un-archive reverse process).

Is that how you would go about dealing with this cold data in EventStore?

Thank you again,

Doug

Chris_Condron · December 11, 2019, 8:22pm

That would be a reasonable approach.

Rather than tomb-stoning archived streams it may be better to write a redirection record to the stream and set the max length to 1.

the scavenge will the pull most of the data out and the running application would know where the stream went. The redirection record may also keep simple data like a display name and archive data & link to allow the running application to display or find the record if needed.

The secondary event store(s) can also leverage lower tier storage, and/or in some cases be left cold and have only a single node spun up to read the data on demand. e.g. pervious years records that are truly read only for cold storage.

Another way to reduce the data in the Event Store is to apply some of the well know principles of 3rd and 1st NF data from classical db theory. when writing to streams keep the pricliples of 3rd NF data to heart. Data should only be written in one place and referenced by surrogate key links if needed (think Guid identifiers to related streams or aggregates closer to a pointer reference than a forgien key in practice, but nearly identical in function. n.b. any FK constraints would be enforced in the aggregate code rather than the db.)

Then when dealing with read models think about them as following a 1st NF style. They are the materielized views, cache objects, or reporting tables that are produced for the system to read. here they pull all of the data together into a targeted view for the application to process. This allow the application to not worry about the equivalent of complex joins rather that handled by the event processors that build the views/read models.

Quite often when I see large data sets in steams it is due to the instinctive reaction to place data in the events for the view or to repeat unchanged data in the update into the stream. The concerns for pulling data together for the view should be solely the job of the read model builders, and they will have the benefit of the current state right there in the target they are updating.

One case where events will have repeated data is when they are public events to be published externally. These are just a different type of read model and should follow the 1st NF style.

Chris