I’ve been looking at performance on our Eventstore cluster (4.0.3), and one things that stands out in the logs is warnings the monitoring queue, e.g:
[00001,61,12:14:56.603] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 1094ms. Q: 0/0.
[00001,108,12:14:59.976] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 1482ms. Q: 0/0.
[00001,26,12:15:02.853] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 1318ms. Q: 0/0.
[00001,12,12:15:08.748] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 1213ms. Q: 0/0.
[00001,10,12:15:15.655] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 1100ms. Q: 0/0.
[00001,50,12:15:20.699] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 1166ms. Q: 0/0.
While we have Eventstore set to record stats at the default interval of 30 seconds, we also have Prometheus scraping metrics of the /stats HTTP endpoint and it turns out this operations appears to be more expensive than expected.
Is collecting stats expected to be an expensive operation? Would this be CPU bound or IO bound? Is it possible that calculating stats may have an impact on other ongoing operations?
Cheers,
Kristian