We are running EventStore 5.0.2 on a AKS cluster (Helm chart) with 3 Ubuntu nodes. We are using port forward to each pod and then from Admin UI we start the scavenge. First time we execute scavenge all goes fine and the scavenge finishes successfully on all pods. If we try to run scavenge again it fails every time on every pod. It runs fine until it reaches specific chunk and then the pod is recreated and scavenge is left in unfinished state. We’ve monitored in AKS UI what is going on and we see that pod gets error “Liveness probe failed … connection refused” and then is recreated but we don’t see why. Are we missing something or doing something wrong?
I doubt that that is the case because the cluster is used only for testing the scavenge. There is no difference in load between first run which is successful and all other consecutive runs which fail on the same chunk each time. By the way, first time that I executed scavenge I’ve started it on all 3 pods at the same time and it finished successful. Every other time after that I’ve executed scavenge only on 1 pod and it failed.
Scavenging the node is an IO-intensive operation, so I suspect that @pconnolly is right. The pod doesn’t get enough resources and is enable to reply to the liveness probe.