Evenstore becomes unavaliable during Scavange

Gabriel_Esposito · February 27, 2019, 1:41pm

Hi,

I had a cluster (size of 2, commit count 1) where during the Scavenge clients failed to connect to eventstore for several hours;

(I’ve attached a log containing everything)

here is a snippet of the log;

06:07:58.171 -0500
PID:01164:013 2019.02.25 11:07:58.171 ERROR StorageScavenger Failed to write the $maxAge of 30 days metadata for the $scavenges stream. Reason: CommitTimeout
Host: 172.38.71.144 Name: /var/log/eventstore/2019-02-25/172.38.71.144-2114-cluster-node.log
2 02/25/2019
06:07:58.171 -0500
PID:01164:013 2019.02.25 11:07:58.171 ERROR StorageScavenger Failed to write the $maxAge of 30 days metadata for the $scavenges stream. Reason: CommitTimeout

seems like clients were unable to connect;

[PID:01164:034 2019.02.25 11:19:26.483 INFO  TcpConnection       ] ES TcpConnection closed [11:19:26.484: N172.38.52.94:49764, L172.38.71.144:1113, {e4742810-1925-484a-b812-d36a92026bdb}]:Close reason: [ConnectionReset] Socket receive error

**one of the instances cames up as DEAD;**

[PID:01178:014 2019.02.25 11:07:52.207 TRACE GossipServiceBase   ] VND {46b987ad-2b20-452b-aebd-26a40bbda635} <DEAD> [Master, 172.38.71.144:1114, n/a, 172.38.71.144:1113, n/a, 172.38.71.144:2113, 172.38.71.144:2114] 13157216444/13157217319/13157217319/E177@13149315809:{8585f9e4-7441-4811-9601-5e877f88716d} | 2019-02-25 11:07:52.196

Additionally it seems like an election occurred after the StorageScavenger failed (see attached log)

Would increasing heartbeat help reduce elections ?
Does StorageScavenger result in clients being unable to connect?

Any advise?

Best,

Gabriel

search-results-2019-02-27T05_34_13.307-0800.csv (135 KB)

Gabriel_Esposito · March 22, 2019, 6:44pm

Hi all,

bumping this thread - anything I can add to improve my question?

Greg_Young1 · March 22, 2019, 6:56pm

You should be able to scavenge and be actively running at the same time … There are a few things here.

what type of machines/environment are you running with?
You never want to run with a cluster size of 2 and a commit count of 1 (you want to use quorums). I believe you will find in logs that its ignoring the 1 and using 2 commits. 1/2 is not a safe environment (can cause data loss). 2/3 is what most would run with.

Cheers,

Greg

Gabriel_Esposito · March 22, 2019, 7:54pm

Hi Greg,

Thanks for the through response it is appreciated.

aws t2.small instances, 30 gb ebs storage
we are running 2 instances this is a non-critical dev/test env

when the cluster became unavailable there was high amount of messages in the logs, that would cause the cluster to become unavailable correct?

https://eventstore.org/docs/server/ports-and-networking/index.html#heartbeat-timeouts

what setting can we adjust to alleviate this ?

-thanks

Greg_Young1 · March 22, 2019, 7:55pm

What exact messages were you getting?

Gabriel_Esposito · March 22, 2019, 8:01pm

I attached a partial log from when clients were unable to connect to my first post , (i think its basically just cycling through dead messages/ failed scavenge / elections)

[PID:01164:013 2019.02.25 11:07:49.989 TRACE GossipServiceBase ] Looks like node [172.38.73.62:1114] is DEAD (TCP connection lost).

[PID:01164:012 2019.02.25 11:07:49.985 TRACE TcpConnectionManager] Closing connection ‘internal-normal’ [172.38.73.62:55492, L172.38.71.144:1114, {4bfcad75-1ab2-4694-b5fa-d4d3397e6d77}] cleanly. Reason: Closing replication subscription connection.
.

.

[PID:01178:014 2019.02.25 11:07:49.653 TRACE GossipServiceBase ] Looks like master [172.38.71.144:2113, {46b987ad-2b20-452b-aebd-26a40bbda635}] is DEAD (Gossip send failed), though we wait for TCP to decide.

we never replaced instances, changed IPS, eventually the cluster recovered on its after several hours

thanks