Lost events during leader election/truncation

Hi! We have a situation where we have lost a few events while the master node was shutting down/restarting.

We have a 3 node cluster with a quorum of 2
The master node was also doing scavenging at the time.
The other nodes discovered the master was dead and started leader election. Another node was proposed as master, but this was rejected by the other node, as it had now discovered the old master was still alive.
Then the master node was found dead again, and election started again. This time a new master was elected. But the log shows gossip still saying MASTER as status for the old master node (so aparently we have 2 masters?)

then the old master discovers that the new masters write checkpoint is less than it’s own, and that truncation is needed. It then starts so shut down

Sometime during all this, one of our servicess successfully wrote 2 events to the eventstore. But later we found that these events where not in Eventstore. How could this happen? The events should have been written to at least two nodes, right? So even if the events where truncated on the old master, at least one of the other nodes should have them?

We are running Eventstore 5.0.8 on ubuntu 18.04.5, and we are using esjc 2.3.0 for connecting to Eventstore

Cheers,
Morten

1 Like

Here are some of the logs from the leader election:
(the missing events should have been written at 14:09:46.807):

Sep 6, 2021 @ 14:09:22.390  SCAVENGING: started to scavenge chunks: #2143-2144 (chunk-002143.000002) 2143 => 2144 (575257182208 => 575794053120)    tl-sxd8e-026.i04.local

Sep 6, 2021 @ 14:09:44.742  There is NO MASTER or MASTER is DEAD according to GOSSIP. Starting new elections. MASTER: [InstanceId: {8478a738-32dc-45e6-b15c-fe913ab5039c}, InternalTcp: 10.241.12.9:1112, InternalSecureTcp: 10.241.12.9:1114, ExternalTcp: 10.241.12.9:1113, ExternalSecureTcp: 10.241.12.9:1115, InternalHttp: 10.241.12.9:2112, ExternalHttp: 10.241.12.9:2113]. tl-sxd8e-026.i04.local

Sep 6, 2021 @ 14:09:44.600  Closing connection 'master-secure' [10.241.12.9:1114, L10.241.12.8:54012, {e0921100-da82-4a20-9b07-32ff1c6a5303}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 3715280  tl-sxd8e-026.i04.local
Sep 6, 2021 @ 14:09:44.606  ES TcpConnectionSsl closed [12:09:44.606: N10.241.12.9:1114, L10.241.12.8:54012, {e0921100-da82-4a20-9b07-32ff1c6a5303}]:Received bytes: 2473830252, Sent bytes: 236696126  tl-sxd8e-026.i04.local
Sep 6, 2021 @ 14:09:44.606  ES TcpConnectionSsl closed [12:09:44.607: N10.241.12.9:1114, L10.241.12.8:54012, {e0921100-da82-4a20-9b07-32ff1c6a5303}]:Send calls: 4409114, callbacks: 4409114    tl-sxd8e-026.i04.local
Sep 6, 2021 @ 14:09:44.606  ES TcpConnectionSsl closed [12:09:44.607: N10.241.12.9:1114, L10.241.12.8:54012, {e0921100-da82-4a20-9b07-32ff1c6a5303}]:Receive calls: 3715309, callbacks: 3715308 tl-sxd8e-026.i04.local
Sep 6, 2021 @ 14:09:44.606  ES TcpConnectionSsl closed [12:09:44.607: N10.241.12.9:1114, L10.241.12.8:54012, {e0921100-da82-4a20-9b07-32ff1c6a5303}]:Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 3715280    tl-sxd8e-026.i04.local
Sep 6, 2021 @ 14:09:44.739  Connection 'master-secure' [10.241.12.9:1114, {e0921100-da82-4a20-9b07-32ff1c6a5303}] closed: Success.  tl-sxd8e-026.i04.local
Sep 6, 2021 @ 14:09:44.739  Looks like node [10.241.12.9:1114] is DEAD (TCP connection lost).   tl-sxd8e-026.i04.local

Sep 6, 2021 @ 14:09:44.864  ELECTIONS: (V=1234) SENDING PROPOSAL CANDIDATE: [10.241.12.7:2112,{fa2209bb-ccca-406a-8d60-566cef166fc2}](L=744196147866,W=744196148509,C=744196148509,E3558@742986393535:{30fa5654-b6b4-45ed-8967-dbd34612faee}), ME: [10.241.12.8:2112,{2608d10c-87d5-4a15-8511-21408090e27f}](L=744196147866,W=744196148509,C=744196148509,E3558@742986393535:{30fa5654-b6b4-45ed-8967-dbd34612faee}).   tl-sxd8e-026.i04.local
Sep 6, 2021 @ 14:09:44.864  ELECTIONS: (V=1234) ACCEPT FROM [10.241.12.8:2112,{2608d10c-87d5-4a15-8511-21408090e27f}] M=[10.241.12.7:2112,{fa2209bb-ccca-406a-8d60-566cef166fc2}]). tl-sxd8e-026.i04.local
Sep 6, 2021 @ 14:09:44.902  ELECTIONS: (V=1234) NOT LEGITIMATE MASTER PROPOSAL FROM [10.241.12.8:2112,{2608d10c-87d5-4a15-8511-21408090e27f}] M=[10.241.12.7:2112,{fa2209bb-ccca-406a-8d60-566cef166fc2}](L=744196147866,W=744196148509,C=744196148509,E3558@742986393535:{30fa5654-b6b4-45ed-8967-dbd34612faee}). PREVIOUS MASTER IS ALIVE: [10.241.12.9:2112,{8478a738-32dc-45e6-b15c-fe913ab5039c}]. tl-sxd8e-025.i04.local
Sep 6, 2021 @ 14:09:45.007  Closing connection 'master-secure' [10.241.12.9:1114, L10.241.12.7:47514, {dbf1c28b-6a20-49a5-92ae-b0989dc2a489}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 2112815  tl-sxd8e-025.i04.local

Sep 6, 2021 @ 14:09:45.068  There is NO MASTER or MASTER is DEAD according to GOSSIP. Starting new elections. MASTER: [InstanceId: {8478a738-32dc-45e6-b15c-fe913ab5039c}, InternalTcp: 10.241.12.9:1112, InternalSecureTcp: 10.241.12.9:1114, ExternalTcp: 10.241.12.9:1113, ExternalSecureTcp: 10.241.12.9:1115, InternalHttp: 10.241.12.9:2112, ExternalHttp: 10.241.12.9:2113]. tl-sxd8e-025.i04.local

Sep 6, 2021 @ 14:09:45.146  ELECTIONS: (V=1235) SENDING PROPOSAL CANDIDATE: [10.241.12.7:2112,{fa2209bb-ccca-406a-8d60-566cef166fc2}](L=744196147866,W=744196148509,C=744196148509,E3558@742986393535:{30fa5654-b6b4-45ed-8967-dbd34612faee}), ME: [10.241.12.7:2112,{fa2209bb-ccca-406a-8d60-566cef166fc2}](L=744196147866,W=744196148509,C=744196148509,E3558@742986393535:{30fa5654-b6b4-45ed-8967-dbd34612faee}).   tl-sxd8e-025.i04.local
Sep 6, 2021 @ 14:09:45.146  ELECTIONS: (V=1235) ACCEPT FROM [10.241.12.7:2112,{fa2209bb-ccca-406a-8d60-566cef166fc2}] M=[10.241.12.7:2112,{fa2209bb-ccca-406a-8d60-566cef166fc2}]). tl-sxd8e-025.i04.local
Sep 6, 2021 @ 14:09:45.150  ELECTIONS: (V=1235) PROPOSAL FROM [10.241.12.7:2112,{fa2209bb-ccca-406a-8d60-566cef166fc2}] M=[10.241.12.7:2112,{fa2209bb-ccca-406a-8d60-566cef166fc2}](L=744196147866,W=744196148509,C=744196148509,E3558@742986393535:{30fa5654-b6b4-45ed-8967-dbd34612faee}). ME=[10.241.12.8:2112,{2608d10c-87d5-4a15-8511-21408090e27f}](L=744196147866,W=744196148509,C=744196148509,E3558@742986393535:{30fa5654-b6b4-45ed-8967-dbd34612faee}).  tl-sxd8e-026.i04.local
Sep 6, 2021 @ 14:09:45.150  ELECTIONS: (V=1235) ACCEPT FROM [10.241.12.7:2112,{fa2209bb-ccca-406a-8d60-566cef166fc2}] M=[10.241.12.7:2112,{fa2209bb-ccca-406a-8d60-566cef166fc2}]). tl-sxd8e-026.i04.local
Sep 6, 2021 @ 14:09:45.150  ELECTIONS: (V=1235) ACCEPT FROM [10.241.12.8:2112,{2608d10c-87d5-4a15-8511-21408090e27f}] M=[10.241.12.7:2112,{fa2209bb-ccca-406a-8d60-566cef166fc2}]). tl-sxd8e-026.i04.local
Sep 6, 2021 @ 14:09:45.150  ELECTIONS: (V=1235) DONE. ELECTED MASTER = [10.241.12.7:2112,{fa2209bb-ccca-406a-8d60-566cef166fc2}](L=744196147866,W=744196148509,C=744196148509,E3558@742986393535:{30fa5654-b6b4-45ed-8967-dbd34612faee}). ME=[10.241.12.8:2112,{2608d10c-87d5-4a15-8511-21408090e27f}](L=744196147866,W=744196148509,C=744196148509,E3558@742986393535:{30fa5654-b6b4-45ed-8967-dbd34612faee}). tl-sxd8e-026.i04.local

Sep 6, 2021 @ 14:09:45.150  ========== [10.241.12.8:2112] PRE-REPLICA STATE, WAITING FOR CHASER TO CATCH UP... MASTER IS [10.241.12.7:2112,{fa2209bb-ccca-406a-8d60-566cef166fc2}]  tl-sxd8e-026.i04.local
Sep 6, 2021 @ 14:09:45.150  Persistent subscriptions received state change to PreReplica. Stopping listening    tl-sxd8e-026.i04.local
Sep 6, 2021 @ 14:09:45.150  Closing connection 'master-secure' [10.241.12.9:1114, L10.241.12.8:54012, {e0921100-da82-4a20-9b07-32ff1c6a5303}] cleanly. Reason: Reconnecting from old master [10.241.12.9:1114] to new master: [10.241.12.7:1114].   tl-sxd8e-026.i04.local
Sep 6, 2021 @ 14:09:45.155  Subscribing at LogPosition: 744196148509 (0xAD458C0D1D) to MASTER [10.241.12.7:1114, {fa2209bb-ccca-406a-8d60-566cef166fc2}] as replica with SubscriptionId: {354c7b2a-5064-458d-b5b5-a443299c5af0}, ConnectionId: {eeee6d9d-4215-4047-b1ec-c28a43c78803}, LocalEndPoint: [10.241.12.8:46522], Epochs:
E3558@742986393535:{30fa5654-b6b4-45ed-8967-dbd34612faee}
E3557@742002186898:{e8a3fe81-97f1-46d5-95d9-27cc5bcfd548}
E3556@742002177317:{a5a5863c-a82e-458b-a83f-a39cda6ed6c9}
E3555@741991578400:{597bf6bb-3682-47eb-a283-4637bc712f02}
E3554@741991578166:{3f7139b3-a644-4453-afda-2914b35c105c}
E3553@737321788570:{fee93d8e-6772-4280-9059-57fe6b0d8361}
E3552@737302591716:{8b03b7f9-c71a-44b7-a337-00421df8476a}
E3551@737302591482:{282e146b-8f95-446a-8d27-c46cecca18d1}
E3550@737302590047:{5e3e05f4-f422-493e-84af-2bec3247643d}
E3549@737302583014:{7c8be6ec-d18e-491a-9807-0ad8ba6594a9}
E3548@737302456874:{e0c7401e-594d-4d8c-b2b7-6581f5fcbf42}
E3547@735666499020:{ece4bbef-89ed-4b27-8cb2-dd6cbcdf9ddf}
E3546@735665292419:{4cb2e4f6-2da7-4722-8f68-84c2c43a0b89}
E3545@735665292185:{171d1b89-3420-4583-a82e-b359f4a19e77}
E3544@735665291951:{e95dd5ab-4cdd-4745-a6a5-d42897e61ae4}
E3543@733421374889:{b9daa2b5-f4be-41b0-b4bf-25f1846179a6}
E3542@733392268096:{385c44bc-e751-4c5f-86df-299b71dd7586}
E3541@729365403177:{5df6ca01-431e-42fd-ad9d-5880423b0aff}
E3540@729362926368:{b2a14c3a-542e-45b2-a5f9-4956c72aa874}
E3539@729362926134:{ec161bc0-9203-40ca-9a70-5d19999300cb}...
.   tl-sxd8e-026.i04.local
Sep 6, 2021 @ 14:09:45.157  ELECTIONS: (V=1235) ACCEPT FROM [10.241.12.8:2112,{2608d10c-87d5-4a15-8511-21408090e27f}] M=[10.241.12.7:2112,{fa2209bb-ccca-406a-8d60-566cef166fc2}]). tl-sxd8e-025.i04.local
Sep 6, 2021 @ 14:09:45.157  ELECTIONS: (V=1235) DONE. ELECTED MASTER = [10.241.12.7:2112,{fa2209bb-ccca-406a-8d60-566cef166fc2}](L=744196147866,W=744196148509,C=744196148509,E3558@742986393535:{30fa5654-b6b4-45ed-8967-dbd34612faee}). ME=[10.241.12.7:2112,{fa2209bb-ccca-406a-8d60-566cef166fc2}](L=744196147866,W=744196148509,C=744196148509,E3558@742986393535:{30fa5654-b6b4-45ed-8967-dbd34612faee}). tl-sxd8e-025.i04.local
Sep 6, 2021 @ 14:09:45.201  Connection 'master-secure' ({eeee6d9d-4215-4047-b1ec-c28a43c78803}) to [10.241.12.7:1114] established.  tl-sxd8e-026.i04.local
Sep 6, 2021 @ 14:09:45.266  ========== [10.241.12.7:2112] PRE-MASTER STATE, WAITING FOR CHASER TO CATCH UP...   tl-sxd8e-025.i04.local
Sep 6, 2021 @ 14:09:45.266  Persistent subscriptions received state change to PreMaster. Stopping listening tl-sxd8e-025.i04.local
Sep 6, 2021 @ 14:09:45.267  Closing connection 'master-secure' [10.241.12.9:1114, L10.241.12.7:47514, {dbf1c28b-6a20-49a5-92ae-b0989dc2a489}] cleanly. Reason: Node state changed to PreMaster. Closing replication connection. tl-sxd8e-025.i04.local
Sep 6, 2021 @ 14:09:45.267  SLOW QUEUE MSG [MainQueue]: ElectionsDone - 109ms. Q: 0/7.  tl-sxd8e-025.i04.local

Sep 6, 2021 @ 14:09:45.375  Connection 'external-secure:dokumentlager-upload-ee44d75b-ca02-4e2f-aedf-fa9214c3fb42' [10.241.12.17:52402, {8644b158-6cea-45cf-88b1-83a9a90b3a76}] closed: Success.    tl-sxd8e-027.i04.local

Sep 6, 2021 @ 14:09:46.704  VND {8478a738-32dc-45e6-b15c-fe913ab5039c} <LIVE> [Master, 10.241.12.9:1112, 10.241.12.9:1114, 10.241.12.9:1113, 10.241.12.9:1115, 10.241.12.9:2112, 10.241.12.9:2113] 744196147866/744196155630/744196155630/E3558@742986393535:{30fa5654-b6b4-45ed-8967-dbd34612faee} | 2021-09-06 12:09:46.271
                        VND {2608d10c-87d5-4a15-8511-21408090e27f} <LIVE> [Slave, 10.241.12.8:1112, 10.241.12.8:1114, 10.241.12.8:1113, 10.241.12.8:1115, 10.241.12.8:2112, 10.241.12.8:2113]  744196150208/744196150385/744196150385/E3559@744196148509:{bb00f180-0228-407d-a42e-e01c3042b1b0} | 2021-09-06 12:09:46.500
                        VND {fa2209bb-ccca-406a-8d60-566cef166fc2} <LIVE> [Master, 10.241.12.7:1112, 10.241.12.7:1114, 10.241.12.7:1113, 10.241.12.7:1115, 10.241.12.7:2112, 10.241.12.7:2113] 744196150208/744196150619/744196150619/E3560@744196150385:{724e2f6c-063a-4bc0-901c-6ae3ec5cc27f} | 2021-09-06 12:09:46.704   tl-sxd8e-025.i04.local

Sep 6, 2021 @ 14:09:46.843  Master [10.241.12.7:1114,{fa2209bb-ccca-406a-8d60-566cef166fc2}] subscribed us at 744196148509 (0xAD458C0D1D), which is less than our writer checkpoint 744196155630 (0xAD458C28EE). TRUNCATION IS NEEDED.  tl-sxd8e-027.i04.local

Sep 6, 2021 @ 14:09:46.914  ========== [10.241.12.9:2112] IS SHUTTING DOWN...   tl-sxd8e-027.i04.local

Sep 6, 2021 @ 14:09:47.655","Got FileBeingDeletedException exception during scavenging, that probably means some chunks were re-replicated.
Scavenging of following chunks will be skipped: #1853-1853 (chunk-001853.000001)
Stopping scavenging and removing temp chunk '/var/lib/eventstore/297da426-7c76-4b2a-8a9c-5719e3f28c90.scavenge.tmp'...
Exception message: Been told the file was deleted > MaxRetries times. Probably a problem in db.."

It’s a rare race condition in V5,
I would advise to upgrade to V20.x or V21.x where this is fixed.

1 Like

@yves.lorphelin We completed our upgrade to 21.10.1 a couple of weeks ago, but seem to have encountered this problem again today. We can provide server logs if so desired, but seems like only images can be uploaded here.