I managed to get it working; it feels a bit all over the place vs v5 tbh.
The end config is currently set to
RunProjections: All
StartStandardProjections: true
Log: /var/log/eventstore
Db: /eventstoredata
CertificateFile: /etc/eventstore/certs/node.crt
CertificatePrivateKeyFile: /etc/eventstore/certs/node.key
TrustedRootCertificatesPath: /etc/eventstore/certs/ca/
AdvertiseHostToClientAs: platform-10-1-3-196.de.prod.xx.uk
IntIp: 10.1.3.196
ExtIp: 10.1.3.196
IntTcpPort: 1112
ExtTcpPort: 1113
IntTcpHeartbeatTimeout: 2500
IntTcpHeartbeatInterval: 1000
ExtTcpHeartbeatTimeout: 2500
ExtTcpHeartbeatInterval: 1000
GossipTimeoutMs: 2500
GossipIntervalMs: 2000
DiscoverViaDns: true
ClusterDns: platform-es-dn.de.prod.xx.uk
ClusterSize: 3
ScavengeHistoryMaxAge: 15
StatsPeriodSec: 260
WriteStatsToDb: false
EnableExternalTcp: true
EnableAtomPubOverHTTP: true
LogLevel: Default
LogFileRetentionCount: 2
LogFailedAuthenticationAttempts: true
With all the logs settings on verbose and the cert generator you use for the blog post I managed to boil it down to a slight difference in the TLS certs generated… Changing my stuff up to as close as I could match the generator still didn’t work.
I ended up mashing what I needed into a copy of your generator and just using that to make self-signed certs in the end. YOLO my time ran out; it threw errors using CA signed certs (which is where this started) and its mega fussy about generated certs. I’ll try to wrap this up in a nice reproducible bow and add it in issues.
So fix in hand I have a working cluster; yey. So as my DNS wiring works fine (worked for 5 and was never the issue); I figured I’d see how well the rolling replacement works.
In V5 we’d spin up new nodes with the EBS data drive based on the latest snapshot; allow those to get up to speed (listed as clone) and then once in sync cull the old nodes.
In V21 this doesn’t seem to work any more. If I have a 3 node cluster and spin up 5 or 6 (not that it should matter as the cluster size is set to three) the new nodes flap around and crash on loop.
Default logging level is next to worthless; bar throwing the “TCP is depreciated” error your entire node can die and it doesn’t actually tell you anything.
es-log{"@t":"2022-11-08T12:21:51.0343376+00:00","@mt":"DEPRECATED\nThe Legacy TCP Client Interface has been deprecated as of version 20.6.0. It is recommended to use gRPC instead.\nAtomPub over HTTP Interface has been deprecated as of version 20.6.0. It is recommended to use gRPC instead\n","@l":"Warning","@i":3819866562,"ProcessId":47602,"ThreadId":1}
Followed by CLUSTER HAS CHANGED
and Subscribing at LogPosition:
Nothing other than those messages being repeated actually gets logged out.
Sooo back to verbose.
Shows the DNS gossip query worked; leader was found; certs read fine etc; InaugurationManager prereplica, then
[64706,14,12:51:48.906,INF] ========== ["10.1.3.196:2113"] CLONE ASSIGNMENT RECEIVED FROM ["n/a","10.1.3.84:1112/platform-es-dn.de.prod.xx.uk",{44279f24-83b8-43d8-b51d-b77d949d5613}].
[64706,14,12:51:48.906,INF] ========== ["10.1.3.196:2113"] IS CLONE... LEADER IS ["10.1.3.84:2113/platform-es-dn.de.prod.xx.uk",{44279f24-83b8-43d8-b51d-b77d949d5613}]
[64706,14,12:51:48.907,INF] ========== ["10.1.3.196:2113"] DROP SUBSCRIPTION REQUEST RECEIVED FROM ["n/a","10.1.3.84:1112/platform-es-dn.de.prod.xx.uk",{44279f24-83b8-43d8-b51d-b77d949d5613}]. THIS MEA>
[64706,14,12:51:48.908,INF] ========== ["10.1.3.196:2113"] IS SHUTTING DOWN...
[64706, 4,12:51:48.917,DBG] Persistent subscriptions received state change to Clone. Stopping listening
[64706, 4,12:51:48.917,DBG] Persistent Subscriptions have been stopped.
[64706, 4,12:51:48.919,VRB] Connection id ""0HMM1H001NHA9"" received SETTINGS frame for stream ID 0 with length 0 and flags ACK
[64706,14,12:51:48.923,DBG] Closing connection '"leader-secure"""' ["10.1.3.84:1112/platform-es-dn.de.prod.xx.uk", L"10.1.3.196:57794", {1e949cb9-ef70-42e4-83ca-70f56596b6c8}] cleanly." Reason: Node st>
[64706,14,12:51:48.926,INF] ES "TcpConnectionSsl" closed [12:51:48.925: N"10.1.3.84:1112", L"10.1.3.196:57794", {1e949cb9-ef70-42e4-83ca-70f56596b6c8}]:Received bytes: 204, Sent bytes: 350
[64706,14,12:51:48.926,INF] ES "TcpConnectionSsl" closed [12:51:48.926: N"10.1.3.84:1112", L"10.1.3.196:57794", {1e949cb9-ef70-42e4-83ca-70f56596b6c8}]:Send calls: 2, callbacks: 2
[64706,14,12:51:48.926,INF] ES "TcpConnectionSsl" closed [12:51:48.926: N"10.1.3.84:1112", L"10.1.3.196:57794", {1e949cb9-ef70-42e4-83ca-70f56596b6c8}]:Receive calls: 5, callbacks: 4
[64706,14,12:51:48.926,INF] ES "TcpConnectionSsl" closed [12:51:48.926: N"10.1.3.84:1112", L"10.1.3.196:57794", {1e949cb9-ef70-42e4-83ca-70f56596b6c8}]:Close reason: [Success] "Node state changed to ShuttingDown. Closing replication c>
[64706,14,12:51:48.927,INF] Connection '"leader-secure"""' ["10.1.3.84:1112", {1e949cb9-ef70-42e4-83ca-70f56596b6c8}] closed: Success.
THIS MEANS THAT THERE IS A SURPLUS OF NODES IN THE CLUSTER, SHUTTING DOWN.
This seems to be saying the only way to replace nodes now is manually with downtime?