Events lost during leader election

Anders_Flemmen · March 7, 2022, 3:11pm

Hello, we are having some unknown issues that causes the leader of our three node cluster to freeze for some seconds, triggering a leader election in the two other nodes. Nearly every time this happens, we lose a couple of events when the previous leader starts responding again. It seems like it accepts some events, even though it is not the leader anymore, and tells the client that they were persisted. Shortly afterwards, it realizes this, and goes offline for truncation. Shouldn’t this be impossible by design?

We have had this problem for quite some time, starting with version 5, as discussed in https://discuss.eventstore.com/t/lost-events-during-leader-election-truncation. We recently upgraded our clusters to version 21.10.1, but the problem seems to be present still. We have been using both the Akka JVM client, the GRPC Java client, and ESJC, and the problem has occured with all of these.

The logs from all three nodes at the time of the incident can be found below. No events were accepted between 10:27:52 and 10:28:03. The events that are missing were written between 10:28:08.871 and 10:28:09.244. There were also other events written in this timespan that did not get lost.
https://gist.github.com/andersflemmen/9cf8630417b4e99f8ea1c2631306e88f <- NEW LEADER
https://gist.github.com/andersflemmen/17858e33dba5a18b305f35c45d2d1c2a <- OLD LEADER
https://gist.github.com/andersflemmen/c6d00cd67ca9f24a6c43be46762de941

timothy.coleman · March 8, 2022, 10:26am

Hi Anders,

Yes this should be prevented by design.

Were the events lost from the beginning of the streams

x-76296fea-2d6b-4d4b-8737-149b18277e73
x-ec16afd8-3814-414d-9d4a-580cb95dbba6

Anders_Flemmen · March 8, 2022, 10:55am

No, the events from those streams are still there.

The streams that lost events are not mentioned in the logs. Also, this has been happening with both new streams, and streams that already have events.

timothy.coleman · April 19, 2022, 1:14pm

Hi Anders,

To our considerable surprise we were able to find a case where the behaviour you described could occur. I’ve filed a ticket for it here https://github.com/EventStore/EventStore/issues/3472

Anders_Flemmen · April 20, 2022, 5:17am

Wow, relieved to hear that! Looking forward to a fix, and hope that it solves the problems we have been seeing. Thanks!