EventStore backup

Hi,

We are planning to backup EventStore DB with nightly jobs. I have a few questions:

  1. What would you suggest: backup one node data or all 3/3 ?

  2. Should ES node be stopped before copying .chk files?

  3. It it necessary to copy all remaining files? Do they contain some important data or only configuration info? (Referencing documentation “Copy the remaining files and directories to the backup location.”)

  4. Can ES node be restored from backuped hard drive ? In example, our data gets corrupted, we wipe data from all nodes, restore 1 (or all) node from hard drive.

Best regards,

Modestas

You only need to back up one (the logs are the same on all of them).

You do not need to stop eventstore

The other files *are the data*

Yes you can restore (just copy the files over)

Just out of interest. Why is it possible to run backup process while ES is running? What happens if someone writes data at that moment?

2016 m. spalis 7 d., penktadienis 17:28:22 UTC+3, Greg Young rašė:

Then it still works.

It works because it was designed to work that way :slight_smile:

What a rude and arrogant “non-answer”. The last and first question was probably asked, because the documentation is lacking - as stated several times. Also the documentation on restoring is lacking (or rather) crappy (see below). Create a copy of of chaser.chk? What IS chaser? Should it overwrite truncate.chk (I assume so), since it is already there. What happens if you forget to rename / overwrite truncate?

https://eventstore.org/docs/server/database-backup/#restoring-a-database

Restoring a database

  1. Create a copy of chaser.chk and call it truncate.chk.
  2. Copy all files to the desired location.
    Should you not stop the service / server before restoring? What about restoring to a cluster? Should all be stopped? Should data be restored to just one or all servers? Come on!

Maybe: Add to documentation:

  1. Stop your eventstore server (on Linux: sudo service eventstore stop).

  2. If in a cluster, delete all data directories on the e.g. 3 servers (e.g. all data in /var/lib/eventstore)

  3. restore data on one server

  4. start that server: sudo service eventstore start

(control that it runs: sudo service eventstore status - which it might not if e.g. you have not restore permissions correctly - important that the owner is also evenstore, not just the group)

  1. Start the other servers’ evenstore services in the cluster

and you should be all good.

/Hoegge

… what IS chaser.chk and what is truncate.chk and the difference?

The .chk files are part of the database this could be a bit clearer. If you think of the chunk files as representing an appending log they represent the position within that log (almost always increment only!). The are just essentially a pointer.

truncate.chk is saying if a truncate were to occur where it should occur to. think of it a cluster as "I know I am ok to here! so in weird failures etc lets truncate to this point and redo stuff not redo everything

chaser.chk is saying where the chaser process has committed itself to.

writer.chk is saying the last place written to the log the writer considers good to

obviously there some interactions between these.

With chaser there is a process following the log (much like a catchupallsubscription! who is going through and doing things). This just marks where its last known position is.

There has been documentation on them at varying points…

What you seem to want though is step by step instructions for your configuration/situation.

Providing for a default is fairly reasonable however as example:

  1. not everyone runs as a “service” and in linux people also run under varying forms of management thus a large number of these steps can vary from install to install

  2. the data is not necessarily in those places it depends on configuration those are defaults eg it becomes “where you configured your data directory to be”

  3. for large databases you absolutely do not want to do this (it can take (a) day(s) to re-replicate, you want to come from backup first then re-replicate as example a 1 TB database will take quite some time over the network which in some cases could be downtime if say a node is lost!). I want to restore first then come up catch up starts from backup location.

  4. sometimes (most common) you only want to restore a single node not all nodes in a cluster (node A has issue, put in node A1 (possibly restore from backup) … bring up).

  5. incremental backups can also be done (faster but worth discussion) … most of those files are immutable.

  6. tape based backup can be done

  7. often running (an) offsite clone(s) which are backups is preferred (near real-time). File copy backups are literally “we had bombs go off in 2 data center type scenarios”. Remember you already have N copies of this data running live!

  8. often (sustained traffic etc) you don’t want backups on primary nodes but prefer to run a clone which is then backed up

  9. many run disk/hw level backups instead of software as primary

  10. this is obviously a bit different in the cloud

  11. many environments are backed up at a lower level

There quite quickly become a lot of “buts” and “ifs” involving situational information beyond “this is how you backup”

I believe there is some tooling coming around this as well :wink:

This is also seems like a place where a document covering “a few common scenarios” would likely be useful as opposed to just the “copy these files”.

Backups are actually a good example of where what is likely needed is more of a holistic doc of “so you want to setup like XYZ1 … here is basic layout … here is how deploys work … here is how backups work … here is monitoring ideas, here is how minor/major upgrades work … here is how restores work…”, in other words tying together things in a nice package as opposed to “here are your 27 options”

Hi Greg,

I was wondering about backups too. But it seems that replication is essentially your backup. Is this right? There wouldn’t be much value in doing a nightly backup because of the gap in events and the nature of projections. You just need to make sure you have fault tolerance. I guess in cases of corrupted data, you might want to “roll back”. Not to be overly philosophical, but it seems like you would want that data, too (since it is important for debugging/fixes/etc).

Of course, situations are arbitrary and far more complex than I could imagine, but thinking about it without all the distributed systems + devops magic-- if I had an append-only log and I started getting corrupted data in, wouldn’t be the basic strategy be to replay from a checkpoint that I knew wasn’t corrupted?

Backups are quite easy.

https://eventstore.org/docs/server/database-backup/

Whether these are done to tapes/disks/etc is fairly straight forward. The benefit of cluster is that it is active and the benefit of say clones is faster (EG: they are constant running and are already a node with the data possible takeover is much … much … faster, manual takeover can be done in ± 5 minutes). Only using clones does has correlated failure risk though I have not seen such a failure as such you might prefer to have a node doing a full backup even with say clones available even though it would be highly unlikely to ever get used.

For rollbacks from a “known good position” for projections this only works if you know the state at that position. Otherwise you must start where that snapshot is from. If you are replaying projections often this is probably something you know how to do well/have strategies in places for handling. Generally I would do a full replay and keep multiple for availability concerns. Read models are relatively easy to make highly available as you can use N of them with relatively little cost (EG I have 3 individual SQL server nodes as opposed to one clustered SQL Server setup … downtime risk–)

Chaser is where the system considers to be the last known good write.

Truncate instructs the system to roll back (truncate/delete) any writes to that point on startup.

The interaction of these two checkpoint files is how the restore deals with writes after the backup starts.

e.g. any writes after the checkpoint is copied are truncated on start up as the checkpoint files is converted to a truncate checkpoint on restore.

Yes the system must be stopped for a restore to be possible, fundamentally the Event Store is an immutable append only log, editing that log even with a restore operation is contraindicated. The log must be replaced and the system started fresh.

As the same log is replicated to each node it is much faster to replace the files on disk than to go through the overhead of republishing and confirming the log throughout the cluster. When the nodes come up and form a cluster they will all detect they have the same log and just begin running forward.

This restore should be viewed as a drop and replace or break glass operation on the cluster.

Incremental backups as rather straightforward as the tfchunk files (data) are immutable.

Copy and overwrite the new chaser pointer file.

Delete and Replace the contents of the Index folder (the index files are not immutable)

Copy any new data files.

-Chris

Greg just pointed out that while the contents of the index folder change the id named files are not modified after writing.

So for the index folder the most efficient approach is:

delete any files not in the backup source

Copy any missing files from the backup source

Copy & Replace the index map file.