We use collectd + riemann + influxdb + grafana for monitoring. After a recent spate of eventstore flakiness, I’d like to add some monitoring to our eventstore cluster. I’m assuming the simplest route is to curl the gossip and stats endpoints, but does anyone have any idea what the vital metrics are?
What things should we flag as critical, and what things should we raise for further investigation?
From stats? Depends what you are trying to monitor (lots of things can
be there!)
From gossip, that you have a cluster is a good start. One and only one
master, two slaves (for a 3 node cluster). Also I would watch the
writer checkpoints to see if a node is falling behind for any reason.
If you are running on Windows and attached storage (in particular) the Average Disk Queue Length is definitely worth monitoring. This is not exposed via Event Store’s stats, but is available as a Windows counter.
If I can ask you a somewhat vague question: what are the first things you look at when you have a misbehaving cluster?
Number of slaves/masters is a great start, and I’ll take a look at the writer-checkpoint.
We currently monitor the roundtrip for a write + consume, and that’s a fantastic health check, because whenever things are going wonky, the graph changes character completely, from a predictable sawtooth to a mess.
Write->Consume is probably the best measurement you can make for
general health. When we pass that we are starting to discuss
monitoring for particular situations. The gossip above is monitoring
nodes perceived reachability etc. The writer checkpoint though is a
good addition in case you have one node behind.
Zero load, just three t2.medium machines sitting idle.
The processing time isn't actually a mean, that's a lie: it's the value of the last processing time from stats.
I want to switch that over to the avgProcessingTime because I think I'm basically reporting on the processing time of GetFreshStats.
I also want to change the visualisation of checkpoints to make the disparity between nodes clearer. Averaging the change over 1 minute does the trick, kinda, but I might report the absolute value instead of a delta.
Hi Greg. I'm afraid I don't instead what you're asking me, but the answer is almost certainly yes in any case.
If you need more info about the metrics or the set up that I'm running, you can always email me, Bob, at made.com.
I think the spikiness is also an artifact of the reporting, it's not the average processing time, it's the last reported processing time, and GetFreshStats is frequently reported as a slow message.
I'll swap over to the average and send you an updated snapshot tomorrow.
Thanks a ton Bob - this was super timely, as I was trying to get collectd pulling out ES metrics just yesterday and found your message.
I’d appreciate a little more explicit clarity from Greg about what this meant: “Write->Consume is probably the best measurement you can make for general health”
As well as the metrics we record directly from the servers, we have an additional metric that we gather from the consumers.
Every minute we send a heartbeat event to eventstore with a timestamp, and we consume that message, recording the difference between the send-time and consume-time.
If everything is working well, the heartbeat-latency is predictable but if the consumer is talking to a dead server, or is unusually busy, or if the dns goes wonky, or eventstore is under a lot of strain - or anything out of the ordinary is happening, then the latency goes up.
Yep this is one way of measuring it. Basically what you are tracking
is how long does it take from the time a write is issued until a
consumer sees the event (another measurement here could be queue
depth). Basically its tracking how far behind a given read model etc
is
Great, thanks both of you. That’s what I suspected. Basically it’s application-level monitoring of ES. Seems sane, and we’ve built similar things I just wanted to clarify, as that’s clearly not info you could pull out of ES with this plugin for example…