We have two instances of the same software running on separate environments. The goal moving forward is to consolidate them into one instance. We are using EventStore 4.1.1 on both.
What is the best way to consolidate these together? I found mention of a node tool to copy events from one server to the other, but I came up short finding more information about it?
Sorry for the delay, I was collecting info from a few folks relating to v4.1.1 and putting some info together for you.
Upgrading from that old of a version is going to a multi phase process and will involve some downtime as you’ll need to bring the cluster down and upgrade all the nodes at once. You’ll want to look at upgrading to version 23.10 as that’s our latest long term supported release and is the current recommended version for production environments.
Our docs don’t include instructions for upgrades from v4 as it’s not possible to rolling upgrade from v4 to v23. You’ll need to go from 4.1.1 to latest 5.x then 23.10.x due to some changes in the TCP clients, and unfortunately that’ll include some downtime. This blog describes the breaking changes from 4 to 5 and the upgrade procedure to give you an idea of where to start. Do also be aware that the TCP client API has been deprecated in version 20.2 and removed in version 24.2. The final version with TCP API support is 23.10. More information can be found in this blog post. One more thing to take a look at relating to that is our migration guide for the gRPC client.
As I was asking around about this, I did get some specific flags from the team that I’m going to pass on directly:
Be aware that we added a max append size, so if you’re doing very large appends you may run into issues
If you use special port and host advertising for public TCP endpoints, those settings no longer exist and will crash the server on startup if not setting the flag to not crash on unknown setting
Double check that the stream names don’t conflict on the two instances, except the system stream names. Otherwise you’ll need to decide on a conflict resolution mechanism
The tool you mentioned in your original post might be the replicator tool
TCP and gRPC libs are different so you will have to rewrite some part of your applications for sure with that upgrade
The gRPC client doesn’t automatically retry like the TCP client on not leader exception nor when the server takes too much time respond
If you’re on .NET specifically you’ll need to upgrade your client and there are a few breaking changes
All that said, I hope this gives you a starting point. I’m also very interested to hear more about your setup as it sounds like you’ve been running 4.1.1 successfully for quite some time. If you’re open to it, I’d love to set up a call and hear more. I can also connect you with our Field Technical Services team, who offer upgrade consulting services, if that’s helpful or interesting (but I would love to learn more about your setup either way.)
Thanks, Alex! We met with your team earlier today to discuss options. Great group of people!
I’m very sorry i saw this thread update late, as I missed the notification. Thank you so much for all of the information! We will work more with your team going forward!
I do have one question regarding the upgrade to 5. I realize an upgrade path doesn’t exist directly from 4 → 23.
Regarding 4 → 5, the docs state:
If you’re running an EventStore cluster, we recommend you upgrade to version 5.0.0 by doing a rolling upgrade from version 4.1.1-hotfix1 without downtime:
Stop a node, upgrade it and start it
does this mean you can stop a node in the 4.1.1 cluster, upgrade it, and bring it back online even though the master is running 4.1.1 and the new node is 5?
You didn’t miss it, I realized while responding to this that we don’t make it super explicit anywhere easily accessible (thank you for asking!) I filed an issue internally to look at making this explicit in our docs, but you’re right – replace binaries and update the default config.
I feel like a list makes it a little easier to follow. For a rolling upgrade, starting with a follower node and ending with the Leader -
Stop the node
Upgrade to desired version, in this case by replacing the binaries and updating your default config to match what you had on the previous version*
Start the node
Wait for the node to become a follower (or read-only replica depending on setup)
Repeat on the next node
*Adding just in case this comes up in search for someone down the line, be aware of breaking changes between versions when adjusting this config
Also copying a bit of info to be aware of from our version 23 upgrade doc, as it’s relevant here as well and I’m not sure if you found this in search already:
Upgrading the cluster in this manner keeps the cluster online and able to service requests. There may still be disruptions to your services during the upgrade, namely:
Client connections may be disconnected when nodes go offline, or when elections take place.
The cluster is less fault tolerant while a node is offline for an upgrade because the cluster requires a quorum of nodes to be online to service write requests.
Replicating large amounts of data to a node can have a performance impact on the Leader in the cluster.
Oh, and projections can get a bit sticky on upgrade and the advice is super dependent on your setup, so if you are using projections make sure you cover that with the folks from the field technical services team.
@jayme.davis : I’m gonna add 1 piece of advise:
Do test the procedure fully in a test / QA environment,
and do test your application on v5 & v23 before doing the upgrade.