Hi,
I had a cluster (size of 2, commit count 1) where during the Scavenge clients failed to connect to eventstore for several hours;
(I’ve attached a log containing everything)
here is a snippet of the log;
06:07:58.171 -0500
PID:01164:013 2019.02.25 11:07:58.171 ERROR StorageScavenger Failed to write the $maxAge of 30 days metadata for the $scavenges stream. Reason: CommitTimeout
Host: 172.38.71.144 Name: /var/log/eventstore/2019-02-25/172.38.71.144-2114-cluster-node.log
2 02/25/2019
06:07:58.171 -0500
PID:01164:013 2019.02.25 11:07:58.171 ERROR StorageScavenger Failed to write the $maxAge of 30 days metadata for the $scavenges stream. Reason: CommitTimeout
seems like clients were unable to connect;
[PID:01164:034 2019.02.25 11:19:26.483 INFO TcpConnection ] ES TcpConnection closed [11:19:26.484: N172.38.52.94:49764, L172.38.71.144:1113, {e4742810-1925-484a-b812-d36a92026bdb}]:Close reason: [ConnectionReset] Socket receive error
**one of the instances cames up as DEAD;**
[PID:01178:014 2019.02.25 11:07:52.207 TRACE GossipServiceBase ] VND {46b987ad-2b20-452b-aebd-26a40bbda635} <DEAD> [Master, 172.38.71.144:1114, n/a, 172.38.71.144:1113, n/a, 172.38.71.144:2113, 172.38.71.144:2114] 13157216444/13157217319/13157217319/E177@13149315809:{8585f9e4-7441-4811-9601-5e877f88716d} | 2019-02-25 11:07:52.196
Additionally it seems like an election occurred after the StorageScavenger failed (see attached log)
Would increasing heartbeat help reduce elections ?
Does StorageScavenger result in clients being unable to connect?
Any advise?
Best,
Gabriel