Best practice for scavenge / triggering failover via HTTP/API

Hi,

I am looking to automate scavenging across a 3 node cluster. Our installation is relatively large, so running a full scavenge will take many hours - close to a day - per node.

The plan was to run this in a staggered way, e.g. say node 1 on Monday, node 2 on Wednesday, node 3 on Friday or similar - to make sure not to bog down the cluster as a whole.

  1. Does this approach make sense (or is it better to “pull the bandaid” across the full cluster at once)?
  2. If so, it would seem reasonable that you want the scavenge operation to run on a non master node (as our writers connect directly to the master node and as such it’s under more load
  3. If so - is there a way to trigger failover - make sure that a particular node is not the master node via the API or some HTTP endpoint (other than brutally bringing it down)

Another approach that we have considered, though I hope we will not need it, is to keep 4 nodes - but only 3 in active use at any point - and run scavenge on a node that is temporarily taken out of the cluster, though this would then require it catching up again when re-joining.

Cheers,
Kristian

since 20.6 there is a way to force a leader to resign.

you should reset the priority to 0 afterwards

Our installation is relatively large, so running a full scavenge will take many hours - close to a day - per node.

How large is it ?