AWS Lambda issues

I’ve been banging my head with my team on this issue for a bit and I’ve reached the reach out point!

We have a 3-cluster self-hosted EventStore in EC2. Here’s the config from one of the nodes:

# Paths
Db: /mnt/eventstore/eventstore
Index: /mnt/eventstore/eventstore/index
Log: /var/log/eventstore

Insecure: true

# Network configuration
IntIp: 10.0.128.33
ExtIp: 10.0.128.33
HttpPort: 2113
IntTcpPort: 1112
ExtTcpPort: 1113
EnableExternalTcp: false
EnableAtomPubOverHTTP: true

# Cluster gossip
ClusterSize: 3
DiscoverViaDns: false
GossipSeed: 10.0.128.31:2113,10.0.128.32:2113

# Projections configuration
RunProjections: None
# Diagnostics
LogLevel: Debug
LogConsoleFormat: Plain
LogHttpRequests: true

# Timeouts and intervals
GossipIntervalMs: 2000
GossipTimeoutMs: 3000
IntTcpHeartbeatInterval: 5000
IntTcpHeartbeatTimeout: 1000

We’re connecting to this with the following connection string from an AWS Lambda:

esdb+discover://10.0.128.31:2113,10.0.128.32:2113,10.0.128.33:2113?tls=false
# Note that we were using DNS and will revert back to it, but wanted to make sure we got it working very explicitly first.*

This connection works for a just bit, then we start seeing endless “deadline exceeded” errors. I’ve been trying to mess with different settings like this connection string, which seem to work to start, then fail and continue failing:

esdb+discover://10.0.128.31:2113,10.0.128.32:2113,10.0.128.33:2113?tls=false&maxDiscoverAttempts=2&defaultDeadline=1000&discoveryInterval=100&gossipTimeout=1&throwOnAppendFailure=true&keepAliveInterval=10000&keepAliveTimeout=10000

Here’s the error we get on Lambda:

 Error: Failed to discover after 2 attempts.
    at discoverEndpoint (/var/task/index.js:415253:17)
    at async Client.resolveUri (/var/task/index.js:415685:43)
    at async Client.createChannel (/var/task/index.js:415662:27)
    at async Client.createGRPCClient (/var/task/index.js:415619:39)
    at async /var/task/index.js:415570:30
    at async ReadStream.initialize (/var/task/index.js:419816:18)

And here’s what we see on the server in kind:

{
    "@t":"2023-09-10T21:14:09.3187649+00:00",
    "@mt":"View Change Proof Send Failed to {Server}",
    "@l":"Information",
    "@i":802807059,
    "@x":"System.AggregateException: One or more errors occurred. (Status(StatusCode=\"DeadlineExceeded\", Detail=\"\"))
          ---> Grpc.Core.RpcException: Status(StatusCode=\"DeadlineExceeded\", Detail=\"\")
               at EventStore.Core.Cluster.EventStoreClusterClient.SendViewChangeProofAsync(Guid serverId, EndPoint serverHttpEndPoint, Int32 installedView, DateTime deadline) in /home/runner/work/TrainStation/TrainStation/build/oss-eventstore/src/EventStore.Core/Cluster/EventStoreClusterClient.Elections.cs:line 115
           --- End of inner exception stack trace ---",
    "Server":"Unspecified/10.0.128.33:2113",
    "SourceContext":"EventStore.Core.Cluster.EventStoreClusterClient",
    "ProcessId":133000,
    "ThreadId":6
}

I have a feeling it’s related to gRPC-node and Lambda’s not playing nicely together, but I’m not sure how to debug further.

We have it set up running locally with a 3-node cluster running in docker with a nearly identical config, and it works flawlessly while being bombarded with 100’s of integration tests.

Any help that anyone can provide in how to diagnose this issue, would be much appreciated :pray:

Another thing to add is that on the Lambda, we’ve tried a couple of options for creating the client itself. We have a function that looks like this:

  protected client(): EventStoreDBClient {
    return EventStoreDBClient.connectionString`${this.host}`;
  }

this returns a new client for every request that’s made, and we have a finally block that disposes the clients like this:

  try {
    client = this.client();
    eventStream = client.readStream(/*...*/);
  //...
  finally {
    if (eventStream) {
      eventStream.destroy();
    }
    await client.dispose();
  }

We’ve tried this approach, and we’ve also tried a singleton approach where we reuse the same connection between lambda invocations. We continue to see the deadline exceeded error on all counts.

We know 100% it works sometimes, so it’s unlikely that it’s a connectivity issue.

Well, it seems that this was the cause of all of our issues. ExtIp being the same as the IntIp made things go haywire. Setting ExtIp to 0.0.0.0 fixed it.

It was also difficult to spot because as we changed the settings, the servers wouldn’t immediately come back up as they were doing a db verify, so for dev mode we set this flag in the server configs:

SkipDbVerify: true

which allowed us to see the effect of changing settings much quicker.

Hope this helps someone out there.

intterresting , the 2 set of IP addreeses (int & ext) are used for
int & ext → used for node 2 node communication
ext => clients

so the first error you saw is the clients not being able to connect to any of the nodes .

what you can do :

  • set both intIp & ExtIP to 0.0.0.0 : this binds the server to all ip addresses .
  • gossip seed : you can defintively set the 3 addresses in there, the node will exclude itself from the list. Makes configuratioon management easier
GossipSeed: 10.0.128.31:2113,10.0.128.32:2113,10.0.128.33:2113

I would advise though to use DNS names for the nodes and eventually cluster DNS for the clients & nodes
it’s going to make your live easier from an operational point of view .

Also you should be using certificates :wink:
DNS discovery and Certificate Management for EventStoreDB versions 21.10.5 and beyond
Gossip Seed configuration and Certificate Management for EventStoreDB