Application error event_store.cluster.Elections/Accept

diego.martin · April 29, 2021, 12:58pm

UPDATE: This post is resolved with a new DNS entry that contains all cluster IP addresses, please see reply to this post to see solution and the new issue.

I’m aware I lack a lot of networking knowledge and, despite Greg’s clarification on ways to discover nodes in a cluster https://github.com/EventStore/EventStore/issues/1878, here it goes my question.

I am unable to run an ESDB cluster as EC2 instances in AWS.
The nodes run but show the exception

{"@t":"2021-04-29T11:42:18.1281960Z","@mt":"Error while retrieving cluster members through DNS.","@l":"Error","@x":"System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (00000005, 0xFFFDFFFF): Name or service not known\n   at System.Net.Dns.GetHostEntryOrAddressesCore(String hostName, Boolean justAddresses)\n   at System.Net.Dns.<>c.<GetHostEntryOrAddressesCoreAsync>b__27_2(Object s)\n   at System.Threading.Tasks.Task`1.InnerInvoke()\n   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)\n--- End of stack trace from previous location ---\n   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)\n--- End of stack trace from previous location ---\n   at System.Threading.Tasks.TaskToApm.End[TResult](IAsyncResult asyncResult)\n   at EventStore.Core.Services.Gossip.DnsGossipSeedSource.EndGetHostEndpoints(IAsyncResult asyncResult) in /home/runner/work/TrainStation/TrainStation/build/oss-eventstore/src/EventStore.Core/Services/Gossip/DnsGossipSeedSource.cs:line 20\n   at EventStore.Core.Services.Gossip.GossipServiceBase.OnGotGossipSeedSources(IAsyncResult ar) in /home/runner/work/TrainStation/TrainStation/build/oss-eventstore/src/EventStore.Core/Services/Gossip/GossipServiceBase.cs:line 109","SourceContext":"EventStore.Core.Services.Gossip.GossipServiceBase","ProcessId":4260,"ThreadId":4}

and it’s probably due to my DNS server.

Node A: 10.0.10.188
Node B: 10.0.20.25
Node C: 10.0.30.95

I’ve created a DNS private hosting zone called saswesdb.io in AWS Route53, associated to my VPC, with entries:

esdb-a.saswesdb.io	A	Simple	-	
10.0.10.188
esdb-b.saswesdb.io	A	Simple	-	
10.0.20.25
	esdb-c.saswesdb.io	A	Simple	-	
10.0.30.95

saswesdb.io	NS	Simple	-	
ns-1536.awsdns-00.co.uk.
ns-0.awsdns-00.com.
ns-1024.awsdns-00.org.
ns-512.awsdns-00.net.

saswesdb.io	SOA	Simple	-	
ns-1536.awsdns-00.co.uk. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400

My configuration is

# Cluster
ClusterSize: 3
ClusterDns: saswesdb.io
DiscoverViaDns: true

# Paths
Db: "/home/ubuntu/my-data"
Log: "/home/ubuntu/my-logs"
Index: "/home/ubuntu/my-index"

# Security
Insecure: true

# Network
IntIp: 10.0.10.188 # this is for the first node, the others have its own ip
ExtIp: 10.0.10.188 # this is for the first node, the others have its own ip
HttpPort: 2113
IntTcpPort: 1113
ExtTcpPort: 1113
EnableExternalTcp: false
EnableAtomPubOverHttp: true

# Projections
RunProjections: None

From each EC2 instance I can ping each other. Also I can ping them using the DNS name

ping esdb-b.saswesdb.io
PING esdb-b.saswesdb.io (10.0.20.25) 56(84) bytes of data.
64 bytes from ip-10-0-20-25.eu-west-1.compute.internal (10.0.20.25): icmp_seq=1 ttl=64 time=0.628 ms
64 bytes from ip-10-0-20-25.eu-west-1.compute.internal (10.0.20.25): icmp_seq=2 ttl=64 time=0.636 ms
64 bytes from ip-10-0-20-25.eu-west-1.compute.internal (10.0.20.25): icmp_seq=3 ttl=64 time=0.654 ms
64 bytes from ip-10-0-20-25.eu-west-1.compute.internal (10.0.20.25): icmp_seq=4 ttl=64 time=0.662 ms

I cannot ping the saswesdb.io though, not sure if I should be able to.

Am I missing anything related to the DNS?

I must admit I don’t know how to troubleshoot this.

PS: EC2 instances have NACL to allow all traffic and a security group that allows all tcp traffic from anywhere (e.g: 0.0.0.0/0)

diego.martin · April 29, 2021, 1:18pm

Ok, Thanks to Issue clustering Greg’s tip on using a DNS entry with multiple IP addresses helped.
So I created an entry

esdb-discovery.saswesdb.io	A	Simple	-	
10.0.10.188
10.0.20.25
10.0.30.95

And now I can set the

ClusterDns: esdb-discovery.saswesdb.io

in each node configuration. The original problem is gone. Now I have another

The node A is up and running, no errors.

The node B shows

{"@t":"2021-04-29T13:12:50.5905435Z","@mt":"Accept Send Failed to {Server}","@x":"System.AggregateException: One or more errors occurred. (Status(StatusCode=\"DeadlineExceeded\", Detail=\"\"))\n ---> Grpc.Core.RpcException: Status(StatusCode=\"DeadlineExceeded\", Detail=\"\")\n   at Grpc.Net.Client.Internal.HttpClientCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)\n   at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext`2 ctx)\n   at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext`2 context, BlockingUnaryCallContinuation`2 continuation)\n   at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)\n   at EventStore.Cluster.Elections.ElectionsClient.Accept(AcceptRequest request, CallOptions options) in /home/runner/work/TrainStation/TrainStation/build/oss-eventstore/src/EventStore.Core/obj/x64/Release/net5.0/ClusterGrpc.cs:line 428\n   at EventStore.Cluster.Elections.ElectionsClient.Accept(AcceptRequest request, Metadata headers, Nullable`1 deadline, CancellationToken cancellationToken) in /home/runner/work/TrainStation/TrainStation/build/oss-eventstore/src/EventStore.Core/obj/x64/Release/net5.0/ClusterGrpc.cs:line 424\n   at EventStore.Core.Cluster.EventStoreClusterClient.SendAcceptAsync(Guid serverId, EndPoint serverHttpEndPoint, Guid leaderId, EndPoint leaderHttp, Int32 view, DateTime deadline) in /home/runner/work/TrainStation/TrainStation/build/oss-eventstore/src/EventStore.Core/Cluster/EventStoreClusterClient.Elections.cs:line 178\n   --- End of inner exception stack trace ---","Server":"Unspecified/10.0.10.188:2113","SourceContext":"EventStore.Core.Cluster.EventStoreClusterClient","ProcessId":3893,"ThreadId":8}
{"@t":"2021-04-29T13:12:50.6229402Z","@mt":"Accept Send Failed to {Server}","@x":"System.AggregateException: One or more errors occurred. (Status(StatusCode=\"DeadlineExceeded\", Detail=\"\"))\n ---> Grpc.Core.RpcException: Status(StatusCode=\"DeadlineExceeded\", Detail=\"\")\n   at Grpc.Net.Client.Internal.HttpClientCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)\n   at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext`2 ctx)\n   at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext`2 context, BlockingUnaryCallContinuation`2 continuation)\n   at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)\n   at EventStore.Cluster.Elections.ElectionsClient.Accept(AcceptRequest request, CallOptions options) in /home/runner/work/TrainStation/TrainStation/build/oss-eventstore/src/EventStore.Core/obj/x64/Release/net5.0/ClusterGrpc.cs:line 428\n   at EventStore.Cluster.Elections.ElectionsClient.Accept(AcceptRequest request, Metadata headers, Nullable`1 deadline, CancellationToken cancellationToken) in /home/runner/work/TrainStation/TrainStation/build/oss-eventstore/src/EventStore.Core/obj/x64/Release/net5.0/ClusterGrpc.cs:line 424\n   at EventStore.Core.Cluster.EventStoreClusterClient.SendAcceptAsync(Guid serverId, EndPoint serverHttpEndPoint, Guid leaderId, EndPoint leaderHttp, Int32 view, DateTime deadline) in /home/runner/work/TrainStation/TrainStation/build/oss-eventstore/src/EventStore.Core/Cluster/EventStoreClusterClient.Elections.cs:line 178\n   --- End of inner exception stack trace ---","Server":"Unspecified/10.0.30.95:2113","SourceContext":"EventStore.Core.Cluster.EventStoreClusterClient","ProcessId":3893,"ThreadId":4}

and Node C shows the following error

{"@t":"2021-04-29T13:12:50.6394740Z","@mt":"Connection id \"{ConnectionId}\", Request id \"{TraceIdentifier}\": An unhandled exception was thrown by the application.","@l":"Error","@x":"System.Threading.Tasks.TaskCanceledException: A task was canceled.\n   at EventStore.Core.Services.Transport.Http.AuthenticationMiddleware.InvokeAsync(HttpContext context, RequestDelegate next) in /home/runner/work/TrainStation/TrainStation/build/oss-eventstore/src/EventStore.Core/Services/Transport/Http/AuthenticationMiddleware.cs:line 24\n   at Microsoft.AspNetCore.Builder.UseMiddlewareExtensions.<>c__DisplayClass6_1.<<UseMiddlewareInterface>b__1>d.MoveNext()\n--- End of stack trace from previous location ---\n   at Microsoft.AspNetCore.Builder.Extensions.MapMiddleware.Invoke(HttpContext context)\n   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)","ConnectionId":"0HM8B21TPRS4M","TraceIdentifier":"0HM8B21TPRS4M:0000000B","EventId":{"Id":13,"Name":"ApplicationError"},"SourceContext":"Microsoft.AspNetCore.Server.Kestrel","RequestId":"0HM8B21TPRS4M:0000000B","RequestPath":"/event_store.cluster.Elections/Accept","ProcessId":4938,"ThreadId":8}

The problem has changed, not the outcome: I’m lost

diego.martin · April 29, 2021, 1:49pm

After restarting node B and C I don’t see the errors anymore.

So… after the long way (sorry) my question is: Is there any way to ensure the cluster is working properly?

chris.condron · April 29, 2021, 2:16pm

You can curl the gossip endpoint.
https://[address]:2113/gossip

diego.martin · April 29, 2021, 2:23pm

Ah, that’s what I was looking for, thanks for the info.

curl http://esdb-a.saswesdb.io:2113/gossip
{
  "members": [
    {
      "instanceId": "d4b33571-7f38-424f-b950-951b3271b758",
      "timeStamp": "2021-04-29T14:19:09.0275189Z",
      "state": "Leader",
      "isAlive": true,
      "internalTcpIp": "10.0.30.95",
      "internalTcpPort": 1113,
      "internalSecureTcpPort": 0,
      "externalTcpIp": "10.0.30.95",
      "externalTcpPort": 0,
      "externalSecureTcpPort": 0,
      "httpEndPointIp": "10.0.30.95",
      "httpEndPointPort": 2113,
      "lastCommitPosition": 601,
      "writerCheckpoint": 1034,
      "chaserCheckpoint": 1034,
      "epochPosition": 786,
      "epochNumber": 2,
      "epochId": "d9576b7b-5d42-4039-8cd2-69a28ec581ce",
      "nodePriority": 0,
      "isReadOnlyReplica": false
    },
    {
      "instanceId": "b4037d47-c8ce-4aea-8150-ed2e30142b4e",
      "timeStamp": "2021-04-29T14:19:09.0257961Z",
      "state": "Follower",
      "isAlive": true,
      "internalTcpIp": "10.0.20.25",
      "internalTcpPort": 1113,
      "internalSecureTcpPort": 0,
      "externalTcpIp": "10.0.20.25",
      "externalTcpPort": 0,
      "externalSecureTcpPort": 0,
      "httpEndPointIp": "10.0.20.25",
      "httpEndPointPort": 2113,
      "lastCommitPosition": 601,
      "writerCheckpoint": 1034,
      "chaserCheckpoint": 1034,
      "epochPosition": 786,
      "epochNumber": 2,
      "epochId": "d9576b7b-5d42-4039-8cd2-69a28ec581ce",
      "nodePriority": 0,
      "isReadOnlyReplica": false
    },
    {
      "instanceId": "e75dff84-d043-40fc-b6e9-572563625c4a",
      "timeStamp": "2021-04-29T14:19:09.0275317Z",
      "state": "Follower",
      "isAlive": true,
      "internalTcpIp": "10.0.10.188",
      "internalTcpPort": 1113,
      "internalSecureTcpPort": 0,
      "externalTcpIp": "10.0.10.188",
      "externalTcpPort": 0,
      "externalSecureTcpPort": 0,
      "httpEndPointIp": "10.0.10.188",
      "httpEndPointPort": 2113,
      "lastCommitPosition": 601,
      "writerCheckpoint": 1034,
      "chaserCheckpoint": 1034,
      "epochPosition": 786,
      "epochNumber": 2,
      "epochId": "d9576b7b-5d42-4039-8cd2-69a28ec581ce",
      "nodePriority": 0,
      "isReadOnlyReplica": false
    }
  ],
  "serverIp": "10.0.10.188",
  "serverPort": 2113

PS: Now I have to figure out how to actually connect using the gRPC client to the master/cluster ES. But that’s another story I haven’t done my research on yet and, hopefully, straightforward.

chris.condron · April 29, 2021, 5:27pm

https://developers.eventstore.com/clients/grpc/getting-started/