Measuring node/cluster load & projection lag

Hello here,

We are conducting some load test on our brand new platform using ES and we experience something new : the projections can lag.

I probably won’t be a surprise for those who already thought about it : on heavy load the projection “Done” percentage lose the good 100% and start accumulating delay. It’s good, it’s super good design, this allow ES to continue eating mode data and delay what can be delayed.

But when it’s accumulating minutes … now our process managers also accumulating delay and they have some business timeouts, so after a while waiting for an event written but not visible in projections, they crash the process.

After this long introduction here is the question : how to monitor this delay properly ?

I have an idea :

=> send heart beats in a stream

=> add a projection to project $et-
- create a new event in an other stream : MyProjectionDelayMeter { heartbeatTimestamp ; projectionstamp }

=> and then i “just” have to graph this MyProjectionDelayMeterS to any reporting.

I’ll also need to have the load of the node at the same time, I assume i can project some of the metrics in $stats-NODEIP in the same reporting so we could know what are our limits, precisely.

So folks, i’m on the good way or there is simpler way to do that ?

Perhaps it’s already in the $stats stream ?

Perhaps someone already made a live plot of this stream ?

You can get this information over the projections http api (same thing
the UI uses)