Backup best practices and reliability

urbanhusky · April 29, 2016, 6:33am

Hi,
I’ve been searching a bit about the backup experiences and issues that people were having.
From what I understand, to backup a node I’d have to copy all .chk files first and then the remaining files.
Can this process be done while Event Store is running?
What would happen if a new chunk was generated after I copied the .chk files?
From what I’ve read, this could cause problems when restoring and might require some manual fixing and contact of support etc.? I would find this unacceptable, especially if we would deploy to our customers all over the world.
*How should backups of clusters be done?Should we backup every single node or would it be enough to backup one node (and just hope that it isn’t corrupted?)
What’s the worst case if Event Store ran on a virtual machine with disk caching? Are such scenarios recoverable (aside from replaying a backup)?*Regards,
Thomas

Greg_Young1 · April 29, 2016, 9:55am

Can this process be done while Event Store is running?

Yes.

What would happen if a new chunk was generated after I copied the .chk files?
From what I've read, this could cause problems when restoring and
might require some manual fixing and contact of support etc.? I would
find this unacceptable, especially if we would deploy to our customers
all over the world.

This is totally fine given the instructions provided.

How should backups of clusters be done?
Should we backup every single node or would it be enough to backup one
node (and just hope that it isn't corrupted?)

The nodes all have the exact same data. Also if you lose a node you
can bring it back by deleting its db and rejoining the cluster (it
will replicate from other nodes).

"What's the worst case if Event Store ran on a virtual machine with
disk caching? Are such scenarios recoverable (aside from replaying a
backup)?"

I am not 100% sure what the question is here.

urbanhusky · April 29, 2016, 2:39pm

Is the order of the chk files important? Because I would think that there is no guarantee in which order they would be copied…

Is Event Store robust enough to start if there are new chunks that are not yet part of the index etc.? I.e. is there any scenario in which copying the chk files in any order (i.e. not at the same time), then the remaining files, would result in any problems?
The background for this question is: we might have dozens, if not hundreds of installations all over the world. It is simply not feasible to assume that we would open a support ticket to get any servers running after an outage - or when restoring data for testing, debugging etc.

Compare this to, say, making a backup of an sql server. The server itself usually provides an interface for that and you get one file that you can then restore. Having to deal with multiple files and not knowing if ES will even start after applying the backup, is kind of a hot issue.

Greg_Young1 · April 29, 2016, 3:00pm

Is the order of the chk files important? Because I would think that
there is no guarantee in which order they would be copied...

Within small time periods no (eg just copy the files). With large time
frames yes it could be an issue. As an example you could have an index
written checkpoint that is ahead of your writer checkpoint which
result in an index rebuild as it doesn't really make sense

Is Event Store robust enough to start if there are new chunks that are
not yet part of the index etc.? I.e. is there any scenario in which
copying the chk files in any order (i.e. not at the same time), then
the remaining files, would result in any problems?

With index the worst thing that will happen is an index rebuild which
may take time but isn't losing information etc.

Do you have some examples where it didn't start after a restore?

urbanhusky · May 2, 2016, 6:01am

Okay, thank you.
I do have to ask though, why backup isn’t an integrated feature in the web UI or otherwise accessible. Most products do feature backup options out of the box

Greg_Young1 · May 2, 2016, 9:25am

Because many things require complicated integration to be able to do a
back up. If its just copying files is it worth having something more
complicated?

urbanhusky · May 4, 2016, 2:17pm

Well, it is copying files in a specific order and restoring requires replacing a file with another file.
Plus, everything can still blow up in your face - I can’t get Event Store to run, most likely because a new chunk was added while the backup was running:

[PID:14072:001 2016.05.04 14:08:42.618 FATAL ProgramBase1 ] Unhandled exception while starting application: During truncation of DB excessive TFChunks were found: D:\Temp\eventstore\testdb\chunk-000023.000000. System.Exception: During truncation of DB excessive TFChunks were found: D:\Temp\eventstore\testdb\chunk-000023.000000. at EventStore.Core.TransactionLog.Chunks.TFChunkDbTruncator.TruncateDb(Int64 truncateChk) in c:\projects\eventstore\src\EventStore.Core\TransactionLog\Chunks\TFChunkDbTruncator.cs:line 38 at EventStore.Core.ClusterVNode..ctor(TFChunkDb db, ClusterVNodeSettings vNodeSettings, IGossipSeedSource gossipSeedSource, InfoController infoController, ISubsystem[] subsystems) in c:\projects\eventstore\src\EventStore.Core\ClusterVNode.cs:line 164 at EventStore.ClusterNode.Program.Create(ClusterNodeOptions opts) in :line 0 at EventStore.Core.ProgramBase1.Run(String[] args) in c:\projects\eventstore\src\EventStore.Core\ProgramBase.cs:line 59
[PID:14072:001 2016.05.04 14:08:42.666 ERROR Application ] Exiting with exit code: 1.
Exit reason: During truncation of DB excessive TFChunks were found:
D:\Temp\eventstore\testdb\chunk-000023.000000.
[PID:04356:001 2016.05.04 14:09:33.735 FATAL ProgramBase1 ] Unhandled exception while starting application: During truncation of DB excessive TFChunks were found: D:\Temp\eventstore\testdb\chunk-000023.000000. System.Exception: During truncation of DB excessive TFChunks were found: D:\Temp\eventstore\testdb\chunk-000023.000000. at EventStore.Core.TransactionLog.Chunks.TFChunkDbTruncator.TruncateDb(Int64 truncateChk) in c:\projects\eventstore\src\EventStore.Core\TransactionLog\Chunks\TFChunkDbTruncator.cs:line 38 at EventStore.Core.ClusterVNode..ctor(TFChunkDb db, ClusterVNodeSettings vNodeSettings, IGossipSeedSource gossipSeedSource, InfoController infoController, ISubsystem[] subsystems) in c:\projects\eventstore\src\EventStore.Core\ClusterVNode.cs:line 164 at EventStore.ClusterNode.Program.Create(ClusterNodeOptions opts) in :line 0 at EventStore.Core.ProgramBase1.Run(String[] args) in c:\projects\eventstore\src\EventStore.Core\ProgramBase.cs:line 59
[PID:04356:001 2016.05.04 14:09:33.753 ERROR Application ] Exiting with exit code: 1.
Exit reason: During truncation of DB excessive TFChunks were found:
D:\Temp\eventstore\testdb\chunk-000023.000000.Enter code here…

``

Am I right in assuming that this means that I should delete the chunk``-000023.000000? P.S.: If I do that, it seems to start but I cannot log in the web interface, nor can I submit events.

Greg_Young1 · May 4, 2016, 2:20pm

Without knowing your process for back up or what has happened its hard to say.

Normally when you have an extra chunk it should just be deleted. We
should add this to the error message, we could do it automatically but
its only during a restore that this is the case so we probably should
not do automatically.

You mention you cant log into web interface or submit events. Submit
events through what? TCP/HTTP? Logs would be helpful.

urbanhusky · May 4, 2016, 2:31pm

I copied all *.chk files, then I copied all remaining files a few seconds later (I selected the files manually, but a similar behaviour is to be expected when running a batch script or similar process).
I was doing this while continuously appending events because I wanted to simulate a live backup scenario. I saw that the new chunk was added after I had copied the *.chk files and thought that this would be a good scenario to test.
I argue that automatically removing the chunk should not be done.
The web interface just doesn’t seem to do anything, it doesn’t let me log in or give me any error messages whatsoever.
I’m using the .NET API, which seems to hang during AppendToStreamAsync - I stopped the client before I would have gotten any exception.
It is only after connecting with the TCP client, that I’m seeing some messages about the index rebuild on the console output and log. There was no such information before, or maybe I was just too impatient.
This tells me that we’ll have to document the backup and restore process and possibly automate it as far as possible because it isn’t done by just copying some files.

Greg_Young1 · May 4, 2016, 2:39pm

"Normally when you have an extra chunk it should just be deleted. We
should add this to the error message, we could do it automatically but
its only during a restore that this is the case so we probably should
not do automatically."

"I argue that automatically removing the chunk should not be done."

You argue agreeing with me? I am confused.

Logs?

urbanhusky · May 4, 2016, 2:49pm

I do agree with you, I merely wanted to phrase it such that I wanted to add my opinion…

I’ve attached the logs. I killed the server at 14:02, around line 8526 in the log - after having made the backup.
I renamed the folder I used for the db, created a new one with the previous name and copied the backup there, overwriting chaser.chk with truncate.chk.
Then I tried to restart ES from a shortcut and it failed. I then tried to start it from a command line, but that failed due to permissions (it was not an elevated command prompt).
Removed chunk and finally started ES from an elevated command prompt.

127.0.0.1-2113-cluster-node.log (2.33 MB)

127.0.0.1-2113-cluster-node-err.log (2.92 KB)

Greg_Young1 · May 4, 2016, 3:05pm

The pause you mention I see here as (and you probably missed it as
there is connection information after it):

[PID:09816:016 2016.05.04 14:38:30.356 INFO HttpAsyncServer ]
Starting HTTP server on
[http://127.0.0.1:2113/,http://localhost:2113/]...
[PID:09816:016 2016.05.04 14:38:30.356 INFO HttpAsyncServer ]
HTTP server is up and listening on
[http://127.0.0.1:2113/,http://localhost:2113/]
[PID:09816:017 2016.05.04 14:38:30.371 TRACE PTable ]
Loading and Verification of PTable
'f010d2d6-8fbc-4b70-9fe5-cd943d622996' started...
.
.
.
.
[PID:09816:027 2016.05.04 14:39:33.969 DEBUG PersistentSubscripti]
Lost connection from 127.0.0.1:13997
[PID:09816:017 2016.05.04 14:41:18.000 TRACE PTable ]
Loading PTable 'f010d2d6-8fbc-4b70-9fe5-cd943d622996' (8000000
entries, cache depth 16) done in 00:02:47.6206039.

What kind of system is this running on that this operation is taking
almost 3 minutes? I would expect this in more like 5 seconds (the file
isn't that big at 8m entries). I also see verifying a chunk file seems
pretty fast (a few seconds). This file at 8m records shouldn't be that
much larger. Does your restart generally take this long to load an
index file?

"Then I tried to restart ES from a shortcut and it failed. I then
tried to start it from a command line, but that failed due to
permissions (it was not an elevated command prompt)."

You can also add ACLs using urlacl as opposed to running as admin.

urbanhusky · May 9, 2016, 6:13am

I’m running this on my dev-machine, Windows 7 Ultimate, i7-3770k, 16GB RAM. The ES database is on a 7200rpm HDD.
Right now, it’s rebuilding the index again after I gracefully shut it down last week via the web interface. I can’t seem to log in until that is completed - and it is taking quite a lot of time to do so.
I did not change anything about the files after the last index rebuild.

We’ll have a serious problem if this scenario occurs in production.

127.0.0.1-2113-cluster-node.log (24.7 KB)

urbanhusky · May 9, 2016, 6:34am

P.S.: I couldn’t get it working with the ACL, which is why I’m currently using a shortcut that runs ES as admin.

oragain · May 18, 2016, 11:56am

Hi Urbanhusky,

This may or may not help in regards to running as something other than admin. The account needs to be allowed to bind/open ports otherwise the ES won’t start because it can’t bind the http port properly. The error usually shows in the log. Especially true if you run it from a task scheduler and the likes.

As for the rest, I am not part of ES so Greg/Pieter/James will correct me if needed

As for the backup, if I am not mistaken, from a conversation I had in the past, it might be better to:

copy the .chk files
copy the chunks

The problem in this case is that you could ‘lose’ all data that come after the .chk files have been saved. However, if you are running clusters, the likelihood of losing all 3 nodes is low and thus when you’d restore a node to the cluster it would sync the missing info.

I do also agree that the rebuild is an issue for us too. The DBs are gonna be growing fatter and fatter that if a rebuild happens we are screwed for a long time

Additionally, depending on what you use the ES for, you might want to backup only the ‘newer’ chunks so that you backup contains all data from inception while the ES only contains the most current data.

Regards,

Greg_Young1 · May 18, 2016, 12:00pm

"The problem in this case is that you could 'lose' all data that come
after the .chk files have been saved. However, if you are running
clusters, the likelihood of losing all 3 nodes is low and thus when
you'd restore a node to the cluster it would sync the missing info.
I do also agree that the rebuild is an issue for us too. The DBs are
gonna be growing fatter and fatter that if a rebuild happens we are
screwed for a long time :)"

When you do a backup you backup up to the point in time you started
the backup. This does not seem to be terribly confusing. What you seem
to want is for a back up to continue backing up while its backing up.

oragain · May 18, 2016, 12:45pm

It is more a question of point in time and data potentially missing. I agree with your remark that in my scenario, the backup is the time at which you copied the chk files. And I am perfectly fine with that.

However in the other process where you copy the chunks first and then the chk files you can actually end up with a bad backup:

Start copy chunks
New chunk is created (thus not part of the copy process except if you script so it takes notice of the new chunk and even then you can end up with missing data because of the flushes)
End copy chunks
Copy chk files

Restore:

chk files point to the future that is not in the chunks => issue => Which I guess would mean a complete rebuild / replay until the last item in the last chunk ?

Personally, I would go for the one I described, because it seems more consistent to me and I accept the loss of the data that goes after the time I saved the chk files. And since we run clusters we even limit that more.

Greg_Young1 · May 18, 2016, 2:35pm

* Start copy chunks
* New chunk is created (thus not part of the copy process except if
you script so it takes notice of the new chunk and even then you can
end up with missing data because of the flushes)
* End copy chunks
* Copy chk files

"Personally, I would go for the one I described, because it seems more
consistent to me"

This will give you a corrupt back up unless you are on a slow moving
system and get lucky or introduce explicit locking of the DB.

Copy chunks:
End copy chunks (data at 777777)
Copy checkpoints (checkpoint at 777888)

You just lost data/have a corrupt backup depending on how you view it.

The other strategy the worst that can happen is an extra chunk. Its
still a point in time back up just at the point in time you started
not at the point in time you finished.

It is also important to note that with some basic rules (if filenames
are the same and not highest in backup providing not scavenging) that
you can incrementally backup.