Hi,
I had a cluster (size of 2, commit count 1) where during the Scavenge clients failed to connect to eventstore for several hours;
(I’ve attached a log containing everything)
here is a snippet of the log;
06:07:58.171 -0500
PID:01164:013 2019.02.25 11:07:58.171 ERROR StorageScavenger Failed to write the $maxAge of 30 days metadata for the $scavenges stream. Reason: CommitTimeout
Host: 172.38.71.144 Name: /var/log/eventstore/2019-02-25/172.38.71.144-2114-cluster-node.log
2 02/25/2019
06:07:58.171 -0500
PID:01164:013 2019.02.25 11:07:58.171 ERROR StorageScavenger Failed to write the $maxAge of 30 days metadata for the $scavenges stream. Reason: CommitTimeout
seems like clients were unable to connect;
[PID:01164:034 2019.02.25 11:19:26.483 INFO TcpConnection ] ES TcpConnection closed [11:19:26.484: N172.38.52.94:49764, L172.38.71.144:1113, {e4742810-1925-484a-b812-d36a92026bdb}]:Close reason: [ConnectionReset] Socket receive error
**one of the instances cames up as DEAD;**
[PID:01178:014 2019.02.25 11:07:52.207 TRACE GossipServiceBase ] VND {46b987ad-2b20-452b-aebd-26a40bbda635} <DEAD> [Master, 172.38.71.144:1114, n/a, 172.38.71.144:1113, n/a, 172.38.71.144:2113, 172.38.71.144:2114] 13157216444/13157217319/13157217319/E177@13149315809:{8585f9e4-7441-4811-9601-5e877f88716d} | 2019-02-25 11:07:52.196
Additionally it seems like an election occurred after the StorageScavenger failed (see attached log)
Would increasing heartbeat help reduce elections ?
Does StorageScavenger result in clients being unable to connect?
Any advise?
Best,
Gabriel
search-results-2019-02-27T05_34_13.307-0800.csv (135 KB)
Hi all,
bumping this thread - anything I can add to improve my question?
You should be able to scavenge and be actively running at the same time … There are a few things here.
-
what type of machines/environment are you running with?
-
You never want to run with a cluster size of 2 and a commit count of 1 (you want to use quorums). I believe you will find in logs that its ignoring the 1 and using 2 commits. 1/2 is not a safe environment (can cause data loss). 2/3 is what most would run with.
Cheers,
Greg
Hi Greg,
Thanks for the through response it is appreciated.
- aws t2.small instances, 30 gb ebs storage
- we are running 2 instances this is a non-critical dev/test env
when the cluster became unavailable there was high amount of messages in the logs, that would cause the cluster to become unavailable correct?
https://eventstore.org/docs/server/ports-and-networking/index.html#heartbeat-timeouts
what setting can we adjust to alleviate this ?
-thanks
What exact messages were you getting?
I attached a partial log from when clients were unable to connect to my first post , (i think its basically just cycling through dead messages/ failed scavenge / elections)
[PID:01164:013 2019.02.25 11:07:49.989 TRACE GossipServiceBase ] Looks like node [172.38.73.62:1114] is DEAD (TCP connection lost).
[PID:01164:012 2019.02.25 11:07:49.985 TRACE TcpConnectionManager] Closing connection ‘internal-normal’ [172.38.73.62:55492, L172.38.71.144:1114, {4bfcad75-1ab2-4694-b5fa-d4d3397e6d77}] cleanly. Reason: Closing replication subscription connection.
.
.
.
[PID:01178:014 2019.02.25 11:07:49.653 TRACE GossipServiceBase ] Looks like master [172.38.71.144:2113, {46b987ad-2b20-452b-aebd-26a40bbda635}] is DEAD (Gossip send failed), though we wait for TCP to decide.
we never replaced instances, changed IPS, eventually the cluster recovered on its after several hours