Commit timeouts causing projection faults

Tyler_Gregory · February 20, 2020, 3:24pm

Hello all,

We have a scavenge operation take several days on a four node cluster (ClusterSize: 3 one is a Clone). It failed on two, succeeded on one, and was stopped manually on the last. During the scavenge processes there were several elections, and just after the final election the same two nodes that marked their scavenge as failed took themselves offline to perform offline truncation. After coming back up, their scavenge processes failed due to a commit timeout to the scavenge stream. Subsequent to this, we have been experiencing projection faults due to commit timeouts intermittently, even though the cluster resource utilization is light (~15% CPU utilization, <1.5k IOPS). Has anyone experienced similar symptoms when running scavenge across all nodes in a cluster?

Environment info:

EventStore version: oss-5.0.5

OS: 3 Ubuntu 18.04 nodes, one Windows node

Using the protobuf API over SSL

eventstore.conf: