20.6.1 from 5.0.9 .NET client

Hi there,

I have just got a 3 node 20.6.1 cluster working, the certificates do not have the IP address SANs added but instead I have configured the “Advertise” to the DNS A records (cluster0,1,2 [group record ‘clusterdns’] and clustergossip0,1,2 - we have a 2 interface setup with internal gossip going via a different interface). I want to document a migration path for our developers (typically all running with 5.x client libraries) but I am struggling to find a way to do this. Ideally this migration is just a connection string change as updating libraries, releasing etc as well as updated connection strings makes things much harder for all involved.

  1. Is it true that without IP SANs means that DNS gossip discovery is effectively broken in 20.x (e.g. “ConnectTo:discover://clusterdns:httpport”) ? Our current thinking is that certificates with IP SANs are generally discouraged so we have avoided that here. Obviously the IPs it gets back from the DNS discovery A records are not valid with the certificates therefore clients cannot read the gossip. (This isn’t just a client issue, I also had to disable DNS Discovery in my cluster node config).
  2. Without working DNS gossip discovery, we can fall back to having a load balancer (nginx in our case) over our clusters, and use that as a fixed gossip seed (e.g. “GossipSeeds=https://cluster-loadbalancer:443;”). Ideally we would avoid adding an extra component to the critical path for ES connections/reconnections but this does a similar job to DNS Discovery and is “ok”.
  3. Whilst I can get #2 working with the latest 20.6.1 .NET client, it doesn’t work with 5.0.9. Not sure why? Perhaps the gossip JSON is different? So is the only way to connect a 5.0.9 client with a 20.6.1 server to use a direct tcp:// address (e.g. “ConnectTo:tcp://clusterdns:exttcpport”)? In which case it will pick a random node regardless of leader or follower status which isn’t workable.
  4. If 3, does that mean we need to update all our clients to the 20.6.1 client package, roll out and then update clusters to 20.6.1. Does the 20.6.1 .NET client work fine with older v5 clusters? (About to test this myself)

My gossip JSON looks something like this:

{
  "members": [
    {
      "instanceId": "f1281999-fc07-43ea-839a-d380bf80cc24",
      "timeStamp": "2020-10-05T21:02:11.3277271Z",
      "state": "Leader",
      "isAlive": true,
      "internalTcpIp": "test-uat-eventstoregossip2",
      "internalTcpPort": 0,
      "internalSecureTcpPort": 11112,
      "externalTcpIp": "test-uat-eventstore2",
      "externalTcpPort": 0,
      "externalSecureTcpPort": 11113,
      "httpEndPointIp": "test-uat-eventstore2",
      "httpEndPointPort": 12113,
      "lastCommitPosition": 12833,
      "writerCheckpoint": 12985,
      "chaserCheckpoint": 12985,
      "epochPosition": 12546,
      "epochNumber": 4,
      "epochId": "c35c4784-672b-4f75-a956-b40b706bae60",
      "nodePriority": 1,
      "isReadOnlyReplica": false
    },
    {
      "instanceId": "b3095846-6f21-49ed-9974-a0fe08cb7100",
      "timeStamp": "2020-10-05T21:02:11.2709331Z",
      "state": "Follower",
      "isAlive": true,
      "internalTcpIp": "test-uat-eventstoregossip1",
      "internalTcpPort": 0,
      "internalSecureTcpPort": 11112,
      "externalTcpIp": "test-uat-eventstore1",
      "externalTcpPort": 0,
      "externalSecureTcpPort": 11113,
      "httpEndPointIp": "test-uat-eventstore1",
      "httpEndPointPort": 12113,
      "lastCommitPosition": 12833,
      "writerCheckpoint": 12985,
      "chaserCheckpoint": 12985,
      "epochPosition": 12546,
      "epochNumber": 4,
      "epochId": "c35c4784-672b-4f75-a956-b40b706bae60",
      "nodePriority": 1,
      "isReadOnlyReplica": false
    },
    {
      "instanceId": "d7b5ae1f-a59b-4b5f-8af4-472939bbf8b3",
      "timeStamp": "2020-10-05T21:02:11.332983Z",
      "state": "Follower",
      "isAlive": true,
      "internalTcpIp": "test-uat-eventstoregossip0",
      "internalTcpPort": 0,
      "internalSecureTcpPort": 11112,
      "externalTcpIp": "test-uat-eventstore0",
      "externalTcpPort": 0,
      "externalSecureTcpPort": 11113,
      "httpEndPointIp": "test-uat-eventstore0",
      "httpEndPointPort": 12113,
      "lastCommitPosition": 12833,
      "writerCheckpoint": 12985,
      "chaserCheckpoint": 12985,
      "epochPosition": 12546,
      "epochNumber": 4,
      "epochId": "c35c4784-672b-4f75-a956-b40b706bae60",
      "nodePriority": 0,
      "isReadOnlyReplica": false
    }
  ],
  "serverIp": "test-uat-eventstore0",
  "serverPort": 12113
}

The issue is that the DNS discovery in 20.6.1 works as follows (exactly like in v5):

  • Resolve the DNS name to IP addresses
  • Call the gossip endpoint of one of the resolved nodes by using the node IP

As you start calling the node by its IP address, it will not validate the certificate, which doesn’t have the IP SAN. So, at this moment, we advise using nodes seed for the cluster gossip when using trusted certificates as you’d have to use a wildcard certificate.

We are well aware of the issue and most likely the way we use DNS to discover the cluster nodes will change, even in the upcoming 20.10 release.