Archiving, snapshots and backups

Hi everybody,

Our devs decided to use event store. I am however still super fuzzy on implementing backup, recovery and audits when using the Event Store. I can’t define a process that is easy and fast that requires as little code as possible (preferably none) and no projections. At the moment projections are not stable enough in our scenarios.

We have an application that, once it is in full production, will require about 250 GB of space per month and 25 million events per month (for the event store). Because of legalities, we need to store 7 years (21 TB + 2.1 billion events and that would be at fixed perimeter which most likely will more be an increasing perimeter but it does not really matter at this point). Because of this we need to be able to provide in a reasonable amount of time any period requested by an auditor over the past 7 years.

There is also at this time 2 types of data in the event store:

  • permanent data => we would want to keep M events for these per stream,

  • live data => we would want to keep the past N months in the event store as well as at least the last event. IE, if the last event is 3 months old, it is still in the event store.

The reason why we would want to keep the last event, is so that every time point has all the streams + a certain amount of logs from the past. This would avoid having to replay millions and millions of events since the beginning of time. And an old stream could actually be re-activated.

At the moment from reading around, the way to load an audit period would be to load the backup of the beginning of the period, replay it into an ES. Then load period+1 read it into the ES, load period+2 read it into the ES, etc… until we get the whole period requested. All of that would require a lot of manual work as well as custom code to replay from the loading ES into the ES containing the whole audit period.

For the backups, they would be running nightly, dumped to tape and have 84 tapes on a rotation.

Does anyone have a better idea ?

And also, how would you go about doing the scavenging with the conditions above ?

Finally is there any advantages in taking snapshots in this scenario ?

Thanks for all and any insights,

Olivier

Note: Recovery wise, we have a 3 nodes cluster so we should be okay in this regard.

The problem is here:

250 GB of space per month and 25 million events per month

Why?

What data is in your "events"?

Claims data with payment information are in the events. Claims data ranges from 1kB to 20kB

Just running a quick calc over your data you have :

25m/month * 5k assuming a reasonable distribution = 125gb/month not
250gb but lets assume it does.

"At the moment from reading around, the way to load an audit period
would be to load the backup of the beginning of the period, replay it
into an ES. Then load period+1 read it into the ES, load period+2 read
it into the ES, etc... until we get the whole period requested. All of
that would require a lot of manual work as well as custom code to
replay from the loading ES into the ES containing the whole audit
period.
For the backups, they would be running nightly, dumped to tape and
have 84 tapes on a rotation."

Why are no snapsots here?

You realize backups are incremental? Perhaps we should improve docs on this.

Why would the backups be incremental ? I mean if I take a backup every day, and the data in the ES is for the past 2 months. Then I have a 2 months - 1 day overlap in the data + the last event of any streams. Which means that backups every 2 months would be an increment to the backup from 2 months ago.

Also, to me incremental, means incremental as in an incremental backup of an SQL server. Which means you need the latest full backup + all increments, in order, from then to now to get now.

It ain’t the only data in the event store :slight_smile: And between 125 GB and 250 GB, there ain’t much of a difference, at least process wise :slight_smile:

"Why would the backups be incremental ? I mean if I take a backup
every day, and the data in the ES is for the past 2 months."

Because we keep a naming convention (its append-only) only new chunk
files have any conflict. EG chunk-3 is always chunk-3 unless its the
current chunk. Once it passes this point its immutable except for
scavenge (scavenge will just happen later)

That helps.

So, let’s say I backup the chunks of the past 2 days (the ones that have a modified date compared to what is on backup), I’ll have a huge pool of chunks in my backup that contains everything from inception as the data in these chunks would not have been scavenged.

Question is, if I have to reload everything, how do I backup the rest ? Indexes, chk files, etc… ? Wouldn’t these be aware that scavenged happened ?

chk files move forward in time only (they represent for isntance the
last written place) so not a problem.

index will rebuild upon restore if not there (and you will get back a
full index)