Hello, we are having some unknown issues that causes the leader of our three node cluster to freeze for some seconds, triggering a leader election in the two other nodes. Nearly every time this happens, we lose a couple of events when the previous leader starts responding again. It seems like it accepts some events, even though it is not the leader anymore, and tells the client that they were persisted. Shortly afterwards, it realizes this, and goes offline for truncation. Shouldn’t this be impossible by design?
We have had this problem for quite some time, starting with version 5, as discussed in https://discuss.eventstore.com/t/lost-events-during-leader-election-truncation. We recently upgraded our clusters to version 21.10.1, but the problem seems to be present still. We have been using both the Akka JVM client, the GRPC Java client, and ESJC, and the problem has occured with all of these.
The logs from all three nodes at the time of the incident can be found below. No events were accepted between 10:27:52 and 10:28:03. The events that are missing were written between 10:28:08.871 and 10:28:09.244. There were also other events written in this timespan that did not get lost.
https://gist.github.com/andersflemmen/9cf8630417b4e99f8ea1c2631306e88f <- NEW LEADER
https://gist.github.com/andersflemmen/17858e33dba5a18b305f35c45d2d1c2a <- OLD LEADER
https://gist.github.com/andersflemmen/c6d00cd67ca9f24a6c43be46762de941