Collectd monitoring

Hey peoples,

We use collectd + riemann + influxdb + grafana for monitoring. After a recent spate of eventstore flakiness, I’d like to add some monitoring to our eventstore cluster. I’m assuming the simplest route is to curl the gossip and stats endpoints, but does anyone have any idea what the vital metrics are?

What things should we flag as critical, and what things should we raise for further investigation?

– Bob.

From stats? Depends what you are trying to monitor (lots of things can
be there!)

From gossip, that you have a cluster is a good start. One and only one
master, two slaves (for a 3 node cluster). Also I would watch the
writer checkpoints to see if a node is falling behind for any reason.

If you are running on Windows and attached storage (in particular) the Average Disk Queue Length is definitely worth monitoring. This is not exposed via Event Store’s stats, but is available as a Windows counter.

If I can ask you a somewhat vague question: what are the first things you look at when you have a misbehaving cluster?

Number of slaves/masters is a great start, and I’ll take a look at the writer-checkpoint.

We currently monitor the roundtrip for a write + consume, and that’s a fantastic health check, because whenever things are going wonky, the graph changes character completely, from a predictable sawtooth to a mess.

Write->Consume is probably the best measurement you can make for
general health. When we pass that we are starting to discuss
monitoring for particular situations. The gossip above is monitoring
nodes perceived reachability etc. The writer checkpoint though is a
good addition in case you have one node behind.

So I have a collectd plugin that I want to open source, and it gives me graphs that look like this: https://snapshot.raintank.io/dashboard/snapshot/5TPjRJZCMwq3jcFwdaJgexZ346t9dkOh

Before I finish it off and make more configurable, is there anything obviously missing from this health dashboard?

-- Bob

This looks pretty good. The times seem a bit high though, what is the
load this is under?

Zero load, just three t2.medium machines sitting idle.

The processing time isn't actually a mean, that's a lie: it's the value of the last processing time from stats.

I want to switch that over to the avgProcessingTime because I think I'm basically reporting on the processing time of GetFreshStats.

I also want to change the visualisation of checkpoints to make the disparity between nodes clearer. Averaging the change over 1 minute does the trick, kinda, but I might report the absolute value instead of a delta.

Can you post? Just looking it seems very spikey and the end-end seems
fairly high.

Hi Greg. I'm afraid I don't instead what you're asking me, but the answer is almost certainly yes in any case.

If you need more info about the metrics or the set up that I'm running, you can always email me, Bob, at made.com.

I think the spikiness is also an artifact of the reporting, it's not the average processing time, it's the last reported processing time, and GetFreshStats is frequently reported as a slow message.

I'll swap over to the average and send you an updated snapshot tomorrow.

-- B

I got an email today reminding me that I’d offered to opensource the collectd plugin, so here it is: https://github.com/madedotcom/eventstore-collectd

We’ve been running this for a few months now, and it’s been both soothing and aesthetically pleasing.

– B

nice will check it out

Thanks a ton Bob - this was super timely, as I was trying to get collectd pulling out ES metrics just yesterday and found your message.

I’d appreciate a little more explicit clarity from Greg about what this meant: “Write->Consume is probably the best measurement you can make for general health”

I can answer that.

As well as the metrics we record directly from the servers, we have an additional metric that we gather from the consumers.

Every minute we send a heartbeat event to eventstore with a timestamp, and we consume that message, recording the difference between the send-time and consume-time.

If everything is working well, the heartbeat-latency is predictable but if the consumer is talking to a dead server, or is unusually busy, or if the dns goes wonky, or eventstore is under a lot of strain - or anything out of the ordinary is happening, then the latency goes up.

Yep this is one way of measuring it. Basically what you are tracking
is how long does it take from the time a write is issued until a
consumer sees the event (another measurement here could be queue
depth). Basically its tracking how far behind a given read model etc
is

Great, thanks both of you. That’s what I suspected. Basically it’s application-level monitoring of ES. Seems sane, and we’ve built similar things I just wanted to clarify, as that’s clearly not info you could pull out of ES with this plugin for example…

eg. https://snapshot.raintank.io/dashboard/snapshot/yhYW1G3dKsPiPfoDjkOuAUZ6r7NnApmi?panelId=1&fullscreen

Spot the outage.