Health check regarding specific node's state


We’ve been noticing that when a node is in any state other than caught up (e.g. Initializing), that the health checks /ping and /gossip are still returning 200 OK.

This has caused us to serve requests to nodes that aren’t fully operational yet and is causing issues.

Is there any way to query a specific node for it’s state through an HTTP request?

Right now, we’ve found a work around by just querying one of our projections, but I’d like to get away from that as projections are subject to change.



/gossip returns the status of the nodes as the body in json

Yup. I saw that. It also supplies status for all of the other nodes as well. I don’t have the ability to parse JSON from the load balancer that I’m using though.

I’m looking more for an http request that responds with 200 ok when a particular node is caught up and ready.

This is a problem for ELB and ALB healthchecks in AWS also, we should probably do this - something like /health/node and /health/cluster. To be clear, gossip or ping are not intended as health checking mechanisms though. I’ll open an issue for this on GitHub.


Any news on that subject ?

We’re having the same problem when rebooting node, as soon as the service is started, /gossip says it’s OK even if the node is still catching-up/hashing, which can take some time.

Parsing the JSON is faisible with haproxy (external-check command + curl + jq), but we used it in a Rancher environnement which does not allow this natively.

And launching two processes for a simple HTTP check is quite overkill.

/health/node would be find if that says that the node is up and ready to process request.



catching-up &hashing

Catching up has a different node state in gossip so I am a bit confused...

For hashing two things, the first is that this can be disabled via
options (it should never fail in the first place! depending on
hardware setups this may actually be a perfectly safe option, we error
on the side of safety in default config).

The second is that the node can become fully alive with tfchunk
hashing still running so its not possible as of now. There could
however be easily added a gossip flag for whether or not hashing is
currently occurring.


Is there any solution now for this problem?

Answering my own question.
Seeing the Docker image declaration here Docker, it seems that there’s a /health/live endpoint available

A node that 's initializing , and other state where it’s functioning is healthy.
The easiest way to check a node health is to look at the cluster info.

You can find inspiration for this in the prometheus exporter.

Note that soon we’ll have a new /metrics endpoint that will lake it even easier to check node status.

Also , if the goal is just to have for instance reads / catch-up on follower nodes you can do that on through the connection string by specifying the nodePreference
example esdb+discover://"
the possible values are Random, Leader, Follower, ReadOnlyReplica
The default being Leader