DNS based clustering issues

chris.mckee · October 21, 2022, 3:44pm

I’ve added the logs etc in a gist to avoid flooding the page

gist.github.com

https://gist.github.com/ChrisMcKee/1158f29eae01719e9bcb4dcdc35ba09f

es21.md


Config 

```yaml
RunProjections: All
StartStandardProjections: true

Log: "/var/log/eventstore"
Db: /eventstoredata

This file has been truncated. show original

I’m using custom-ca (because a paid wildcard cert didn’t work, separate issue); I’ve set the ca as trusted etc that all seems fine; I can curl other nodes on 2113/info to check the connection and trust works.

Read through www.eventstore.com/blog/notes-on-certificate-management-for-eventstoredb-versions-21.10.5-and-beyond

My DNS response returns the internal ips as it should (and has done for v5 prior)
I’ve added in the new requirement of host dns names

Setting the ClusterGossipPort: 2112 causes connection errors as ES no longer exposes a port 2112 and the http options have gone for that. So I clocked that eventually and commented it out.

Oct 21 15:29:27 ip-10-11-3-237 bash[3737]: [ 3737,22,15:29:27.703,WRN] Failed authorization check for "(anonymous)" in 00:00:00.0000169 with "node/gossip : update Deny : Policy : Legacy 1 12/31/9999 23:59:59 +00:00 : default:denied by default:Deny, $"
Oct 21 15:29:27 ip-10-11-3-237 bash[3737]: [ 3737,22,15:29:27.704,INF] Error status code 'PermissionDenied' raised.
Oct 21 15:29:27 ip-10-11-3-237 bash[3737]: Grpc.Core.RpcException: Status(StatusCode="PermissionDenied", Detail="Access Denied")
Oct 21 15:29:27 ip-10-11-3-237 bash[3737]:    at EventStore.Core.Services.Transport.Grpc.Cluster.Gossip.Update(GossipRequest request, ServerCallContext context) in /home/runner/work/TrainStation/TrainStation/build/oss-eventstore/src/EventStore.Core/Services/Transport/Grpc/Cluster>
Oct 21 15:29:27 ip-10-11-3-237 bash[3737]:    at Grpc.Shared.Server.UnaryServerMethodInvoker`3.AwaitInvoker(Task`1 invokerTask, GrpcActivatorHandle`1 serviceHandle)
Oct 21 15:29:27 ip-10-11-3-237 bash[3737]:    at Grpc.Shared.Server.UnaryServerMethodInvoker`3.AwaitInvoker(Task`1 invokerTask, GrpcActivatorHandle`1 serviceHandle)
Oct 21 15:29:27 ip-10-11-3-237 bash[3737]:    at Grpc.AspNetCore.Server.Internal.CallHandlers.UnaryServerCallHandler`3.HandleCallAsyncCore(HttpContext httpContext, HttpContextServerCallContext serverCallContext)
Oct 21 15:29:27 ip-10-11-3-237 bash[3737]:    at Grpc.AspNetCore.Server.Internal.CallHandlers.ServerCallHandlerBase`3.<HandleCallAsync>g__AwaitHandleCall|8_0(HttpContextServerCallContext serverCallContext, Method`2 method, Task handleCall)

I assume I’m missing something;

I tried to put a boiled down summary of logs etc here but the forum see’s everything including in markdown as a link.

yves.lorphelin · October 21, 2022, 4:51pm

Just a hunch , need to test it out myself
in the second config you have in the gist.

certificate is *.esdns.domain.uk
cluster DNS is esdns.domain.uk
=>*.esdns.domain.uk is not a valid certificate for esdns.domain.uk
What does it give if if you would have the cluster DNS ENtries as e.G

cluster.esdns.domain.uk. 60 IN A 10.11.1.148
cluster.esdns.domain.uk. 60 IN A 10.11.2.20
cluster.esdns.domain.uk. 60 IN A 10.11.3.25

ClusterDns: cluster.esdns.domain.uk instead of ClusterDns: esdns.domain.uk in the config files

chris.mckee · October 21, 2022, 5:29pm

Yeah sorry that’s me ‘cleaning’ the internal names out

The DNS domain is pretty much as per your example.

chris.mckee · October 23, 2022, 5:31pm

Wired up similar using docker-compose https://github.com/ChrisMcKee/debug-eventstoredbcluster
The dnsmasq providing the lookup for dnslookup.eventstore.local as expected on the nodes etc.

hayley.campbell · October 24, 2022, 9:29am

Try change ClusterGossipPort in your configuration to 2113, or remove that configuration line entirely

The distinction between internal and external HTTP interfaces was removed in version 20.10 and above, so all HTTP traffic (including gossip and gRPC) goes over port 2113 by default now.

chris.mckee · October 24, 2022, 10:55am

Yeah I clocked that half way down the gist. And moved straight into the next error

yves.lorphelin · October 25, 2022, 1:55pm

are you unblocked now ?

chris.mckee · October 25, 2022, 2:12pm

Nope. The gist cluster is the same. I’d worked out the gossip issue when I realised setting it to 2112 didn’t open any listening port (may as well remove that config entirely)

Rolled straight into the auth issues at the end of the gist.

I’m off at the mo, so the docker compose repo was me burning my families good will to try and reproduce it outside of aws.

alexey.zimarev · November 8, 2022, 12:17pm

Can’t find that anyone mentioning this:

CertificateReservedNodeCommonName: "*.esdns.domain.uk"

yves.lorphelin · November 8, 2022, 12:26pm

@chris.mckee could you post privately your actual DNS entries & public certificate & config for 1 of the node ?

chris.mckee · November 8, 2022, 12:59pm

I managed to get it working; it feels a bit all over the place vs v5 tbh.

The end config is currently set to

RunProjections: All
StartStandardProjections: true

Log: /var/log/eventstore
Db: /eventstoredata

CertificateFile: /etc/eventstore/certs/node.crt
CertificatePrivateKeyFile: /etc/eventstore/certs/node.key
TrustedRootCertificatesPath: /etc/eventstore/certs/ca/

AdvertiseHostToClientAs: platform-10-1-3-196.de.prod.xx.uk

IntIp: 10.1.3.196
ExtIp: 10.1.3.196

IntTcpPort: 1112
ExtTcpPort: 1113

IntTcpHeartbeatTimeout: 2500
IntTcpHeartbeatInterval: 1000
ExtTcpHeartbeatTimeout: 2500
ExtTcpHeartbeatInterval: 1000

GossipTimeoutMs: 2500
GossipIntervalMs: 2000

DiscoverViaDns: true
ClusterDns: platform-es-dn.de.prod.xx.uk
ClusterSize: 3

ScavengeHistoryMaxAge: 15
StatsPeriodSec: 260
WriteStatsToDb: false

EnableExternalTcp: true
EnableAtomPubOverHTTP: true

LogLevel: Default
LogFileRetentionCount: 2
LogFailedAuthenticationAttempts: true

With all the logs settings on verbose and the cert generator you use for the blog post I managed to boil it down to a slight difference in the TLS certs generated… Changing my stuff up to as close as I could match the generator still didn’t work.
I ended up mashing what I needed into a copy of your generator and just using that to make self-signed certs in the end. YOLO my time ran out; it threw errors using CA signed certs (which is where this started) and its mega fussy about generated certs. I’ll try to wrap this up in a nice reproducible bow and add it in issues.

So fix in hand I have a working cluster; yey. So as my DNS wiring works fine (worked for 5 and was never the issue); I figured I’d see how well the rolling replacement works.

In V5 we’d spin up new nodes with the EBS data drive based on the latest snapshot; allow those to get up to speed (listed as clone) and then once in sync cull the old nodes.

In V21 this doesn’t seem to work any more. If I have a 3 node cluster and spin up 5 or 6 (not that it should matter as the cluster size is set to three) the new nodes flap around and crash on loop.

Default logging level is next to worthless; bar throwing the “TCP is depreciated” error your entire node can die and it doesn’t actually tell you anything.

es-log{"@t":"2022-11-08T12:21:51.0343376+00:00","@mt":"DEPRECATED\nThe Legacy TCP Client Interface has been deprecated as of version 20.6.0. It is recommended to use gRPC instead.\nAtomPub over HTTP Interface has been deprecated as of version 20.6.0. It is recommended to use gRPC instead\n","@l":"Warning","@i":3819866562,"ProcessId":47602,"ThreadId":1}

Followed by CLUSTER HAS CHANGED and Subscribing at LogPosition:

Nothing other than those messages being repeated actually gets logged out.
Sooo back to verbose.

Shows the DNS gossip query worked; leader was found; certs read fine etc; InaugurationManager prereplica, then

[64706,14,12:51:48.906,INF] ========== ["10.1.3.196:2113"] CLONE ASSIGNMENT RECEIVED FROM ["n/a","10.1.3.84:1112/platform-es-dn.de.prod.xx.uk",{44279f24-83b8-43d8-b51d-b77d949d5613}].
[64706,14,12:51:48.906,INF] ========== ["10.1.3.196:2113"] IS CLONE... LEADER IS ["10.1.3.84:2113/platform-es-dn.de.prod.xx.uk",{44279f24-83b8-43d8-b51d-b77d949d5613}]
[64706,14,12:51:48.907,INF] ========== ["10.1.3.196:2113"] DROP SUBSCRIPTION REQUEST RECEIVED FROM ["n/a","10.1.3.84:1112/platform-es-dn.de.prod.xx.uk",{44279f24-83b8-43d8-b51d-b77d949d5613}]. THIS MEA>
[64706,14,12:51:48.908,INF] ========== ["10.1.3.196:2113"] IS SHUTTING DOWN...
[64706, 4,12:51:48.917,DBG] Persistent subscriptions received state change to Clone. Stopping listening
[64706, 4,12:51:48.917,DBG] Persistent Subscriptions have been stopped.
[64706, 4,12:51:48.919,VRB] Connection id ""0HMM1H001NHA9"" received SETTINGS frame for stream ID 0 with length 0 and flags ACK
[64706,14,12:51:48.923,DBG] Closing connection '"leader-secure"""' ["10.1.3.84:1112/platform-es-dn.de.prod.xx.uk", L"10.1.3.196:57794", {1e949cb9-ef70-42e4-83ca-70f56596b6c8}] cleanly." Reason: Node st>
[64706,14,12:51:48.926,INF] ES "TcpConnectionSsl" closed [12:51:48.925: N"10.1.3.84:1112", L"10.1.3.196:57794", {1e949cb9-ef70-42e4-83ca-70f56596b6c8}]:Received bytes: 204, Sent bytes: 350
[64706,14,12:51:48.926,INF] ES "TcpConnectionSsl" closed [12:51:48.926: N"10.1.3.84:1112", L"10.1.3.196:57794", {1e949cb9-ef70-42e4-83ca-70f56596b6c8}]:Send calls: 2, callbacks: 2
[64706,14,12:51:48.926,INF] ES "TcpConnectionSsl" closed [12:51:48.926: N"10.1.3.84:1112", L"10.1.3.196:57794", {1e949cb9-ef70-42e4-83ca-70f56596b6c8}]:Receive calls: 5, callbacks: 4
[64706,14,12:51:48.926,INF] ES "TcpConnectionSsl" closed [12:51:48.926: N"10.1.3.84:1112", L"10.1.3.196:57794", {1e949cb9-ef70-42e4-83ca-70f56596b6c8}]:Close reason: [Success] "Node state changed to ShuttingDown. Closing replication c>
[64706,14,12:51:48.927,INF] Connection '"leader-secure"""' ["10.1.3.84:1112", {1e949cb9-ef70-42e4-83ca-70f56596b6c8}] closed: Success.

THIS MEANS THAT THERE IS A SURPLUS OF NODES IN THE CLUSTER, SHUTTING DOWN.

This seems to be saying the only way to replace nodes now is manually with downtime?

alexey.zimarev · November 8, 2022, 1:13pm

You must have the same number of machines as you have specified in the cluster size. It doesn’t impede the rolling upgrade as you can safely take one of the three nodes down, upgrade it, and bring it back up. We do it all the time in the cloud.

As for the your config not working, as I said, you need to specify the CertificateReservedNodeCommonName that must match your certificate CN and it must be the same for certificates on all the nodes. By using our cert gen tool you generated certificates with our default CN, and it worked. Basically, if you would have used the tool we provide, there would be no issues in making the initial configuration work.

chris.mckee · November 8, 2022, 1:29pm

It doesn’t impede the rolling upgrade as you can safely take one of the three nodes down
So you remove a living node (I get the quorum is 2) to replace it? we’ve always added extra clones then removed old nodes one by one; so we were swapping a ready node in.

I did set the CertificateReservedNodeCommonName when I used the purchased sectigo wildcard certificate. And I used the tool which validated the dns seed etc (well as far as the green tick box does)

It generated

---
# Paths
Db: /var/lib/eventstore
Index: /var/lib/eventstore/index
Log: /var/log/eventstore

# Certificates configuration
CertificateFile: /etc/eventstore/certs/node.crt
CertificatePrivateKeyFile: /etc/eventstore/certs/node.key
TrustedRootCertificatesPath: /etc/ssl/certs
CertificateReservedNodeCommonName: "*.de.prod.xx.uk"

# Network configuration
IntIp: 10.1.3.84
ExtIp: 10.1.3.84
IntHostAdvertiseAs: platform-10-1-3-84.de.prod.xx.uk
ExtHostAdvertiseAs: platform-10-1-3-84.de.prod.xx.uk
HttpPort: 2113
IntTcpPort: 1112
ExtTcpPort: 1113
EnableExternalTcp: true
EnableAtomPubOverHTTP: true

# Cluster gossip
ClusterSize: 3
DiscoverViaDns: true
ClusterDns: platform.de.prod.xx.uk

# Projections configuration
RunProjections: All

Which threw errors with the sectigo cert (only visible in verbose grpc); I generated self-signed for the same domain to try and exclude the issue and ended up in the rest of the mess as the system could talk to another nodes 2113/info address but grpc failed.

By using our cert gen tool you generated certificates with our default CN, and it worked.

Yeah I left your default CN in there and added the wildcard dns path as a SAN. Skipped over the IP addresses portion to try and get close to what was wanted originally.

alexey.zimarev · November 8, 2022, 1:33pm

It’s the first time I hear about such approach. The way how rolling upgrades are normally executed is how I described. I believe most of other databases that use replica sets with a single leader do the same (Mongo, Elastic, etc)

It’s weird that the certificate didn’t work. We use the same config for ES Cloud, and I did the same using Letsencrypt many times, it worked fine.

chris.mckee · November 8, 2022, 1:48pm

I see its behind the UNSAFE ALLOW SURPLUS NODES flag now; must have missed that slipping in as its in the V6 change notes.

Can only vaguely remember updating elasticsearch, which was a chore. Atlas blue-greens + replicas mongo for upgrades etc which is quite nice.

I’ll try to sort reproductions of the cert errors out etc. Least I know “why” its not doing what it used to

yves.lorphelin · November 8, 2022, 1:58pm

thos 2 blog post are giving details on how to replace node & upgrade in place :

chris.mckee · November 8, 2022, 3:17pm

Aye thanks; I linked one of those at the top originally; Lost the second one in the post because of the restrictions on posting links
I think I’m allowed to post more than two links now so dobby is a free elf