Event store crashing

Faisal_Afzal · March 6, 2015, 1:58pm

Hi All,

I have setup a three node cluster. I see in the logs that some of the node are crashing, it says 'truncation needed' b.c writer checkoutpoint is incorrect. I don't know why, does any one have idea how to prevent it?. Thx

Greg_Young1 · March 6, 2015, 2:00pm

And when you restart it?

Faisal_Afzal · March 6, 2015, 2:07pm

I didn’t restart it, it recover itself.

Greg_Young1 · March 6, 2015, 2:08pm

Perhaps you can explain more? The process exited no (it says it did in
the log)? Did you or some automation restart the process after? What
happened when it restarted?

Greg_Young1 · March 6, 2015, 2:10pm

I am guessing from your messages this happened?

[04153,01,15:02:27.659] MessageHierarchy initialization took 00:00:00.0859963.
[04153,01,15:02:27.697] Truncate checkpoint is present. Truncate:
6780144641 (0x19420CC01), Writer: 6780144809 (0x19420CCA9), Chaser:
6780144809 (0x19420CCA9), Epoch: 6302324412 (0x177A5D6BC)
[04153,01,15:02:27.916] Truncating chaser from 6780144809
(0x19420CCA9) to 6780144641 (0x19420CC01).
[04153,01,15:02:27.916] Truncating writer from 6780144809
(0x19420CCA9) to 6780144641 (0x19420CC01).

and then everything worked fine?

Greg

Faisal_Afzal · March 6, 2015, 2:20pm

Yes, everything is fine afterwards, but this happen a lot of time, i see it on the others nodes (logs) as well.

Greg_Young1 · March 6, 2015, 2:24pm

The only time that should happen is if you have a situation that is
known as a deposed master. This situation is where the master ends up
in a minority partition of the cluster due to network partition etc.
It is a normal operation and a handled situation (doing it online is
extremely tricky, its easier to do offline which is what we do and
handle the case on the restart).

You say it happens "a lot of the time". Can you define "a lot"? What
kind of networking etc do you have between nodes? These types of
network partitions should be rare in an environment

Faisal_Afzal · March 6, 2015, 2:52pm

We are using blade servers in a Dell M1000 chassis. CentOS 7 is installed. We are using a bond0.xxx network interface. No diff between int and ext traffic.

Startup command is: sudo -E -u eventstore /opt/eventstore/clusternode --http-prefixes=http://+:2113/,http://somedns.name:2112/ --config=/etc/eventstore/eventstore.yml

Clients are connecting through a loadbalancer, nodes are connected through bus topology.

On node1, node2 it happend twice a week but on a different date but on node1 it happend one.

Greg_Young1 · March 6, 2015, 2:59pm

So the time this happens is when the nodes can't talk to each other.
Are there other blades in the chassis? Do these things happen to occur
at certain times? You may if on a shared network also want to tweak
things like heartbeat timeouts between nodes though I believe they are
about 500ms by default which should be plenty

In general you generally want to supervise a node and automatically
restart it. This is pretty much a solved problem in linux.

Faisal_Afzal · March 6, 2015, 3:40pm

Production nodes are connected to two Dell M6348 blade switches. Blades 10/11 & 12 are used. On the bonded interface two vlans are allowed. At this moment only one vlan is configured for server side. Node fails in a random order, there is no specific pattern observed.

Greg_Young1 · March 6, 2015, 3:45pm

So if you configure supervision your problem will go away (as you see
on a restart things are resolved). It is a perfectly normal edge case.
There are about 5000 ways of doing this in linux
http://jtimberman.housepub.org/blog/2012/12/29/process-supervision-solved-problem/
this is why for instance the commercial version runs with a
supervisor, there are some cases where the best thing to do is bomb
out and fix on restart.

As for why your nodes are losing communication with each other this
can be a long and drawn out process to determine as networks are
involved (especially if they are shared, a server doing a backup can
cause as an example).

Cheers,

Greg

Greg_Young1 · March 6, 2015, 3:46pm

Also as with the other thread, are you setting timeouts for heartbeats
on internal network? If so to what?

Faisal_Afzal · March 6, 2015, 4:05pm

You mean IntTcpHeartbeatTimeout, it is set to default value (700 ms) which is big enough.