Hi everybody,
Our devs decided to use event store. I am however still super fuzzy on implementing backup, recovery and audits when using the Event Store. I can’t define a process that is easy and fast that requires as little code as possible (preferably none) and no projections. At the moment projections are not stable enough in our scenarios.
We have an application that, once it is in full production, will require about 250 GB of space per month and 25 million events per month (for the event store). Because of legalities, we need to store 7 years (21 TB + 2.1 billion events and that would be at fixed perimeter which most likely will more be an increasing perimeter but it does not really matter at this point). Because of this we need to be able to provide in a reasonable amount of time any period requested by an auditor over the past 7 years.
There is also at this time 2 types of data in the event store:
-
permanent data => we would want to keep M events for these per stream,
-
live data => we would want to keep the past N months in the event store as well as at least the last event. IE, if the last event is 3 months old, it is still in the event store.
The reason why we would want to keep the last event, is so that every time point has all the streams + a certain amount of logs from the past. This would avoid having to replay millions and millions of events since the beginning of time. And an old stream could actually be re-activated.
At the moment from reading around, the way to load an audit period would be to load the backup of the beginning of the period, replay it into an ES. Then load period+1 read it into the ES, load period+2 read it into the ES, etc… until we get the whole period requested. All of that would require a lot of manual work as well as custom code to replay from the loading ES into the ES containing the whole audit period.
For the backups, they would be running nightly, dumped to tape and have 84 tapes on a rotation.
Does anyone have a better idea ?
And also, how would you go about doing the scavenging with the conditions above ?
Finally is there any advantages in taking snapshots in this scenario ?
Thanks for all and any insights,
Olivier
Note: Recovery wise, we have a 3 nodes cluster so we should be okay in this regard.