See second comment. There are log messages about a simple message processing taking 4 seconds! This would trip timeouts between nodes. As a side note I’d be curious how many of these issues you have running on a local node. With virtualization there are countless things that could have happened (eg shared node and something high court jumped off etc)
It’s frequent because you’re seeing frequent elections, likely due to all the timeouts.
What are you config settings?
@Greg - It’s always possible that virtualization is a factor. Alas, none of the AWS metrics show anything unusual. Can you give us an idea of the (kernel) resources accessed during simple message processing, so that we can dig deeper?
When we ran clusters on non-virtual hardware, we also saw very frequent elections. In our experience, network congestion really does a number on the gossip traffic. However, on these systems, the elections never caused suicides. It’s not quite apples to apples, since the non-virtual nodes were running Windows. We’ll start building a non-virtual Linux cluster.
@James - I’ll work on getting those settings.
When we ran clusters on non-virtual hardware, we also saw very
frequent elections. In our experience, network congestion really does
a number on the gossip traffic. However, on these systems, the
elections never caused suicides. It's not quite apples to apples,
since the non-virtual nodes were running Windows. We'll start
building a non-virtual Linux cluster."
Its dumb luck.
If you lose a master it will every time deterministically the "crash" you see.
We are looking at another possible root cause dealing with gc.
Can you print out what is printed with --what-if when you run it?
What-if gives me the following:
Exit reason: WhatIf option specified
ES VERSION: 220.127.116.11 (master/a1382252dd1ed0554ddb04015cdb2cbc1b0a65c1, Fri, 13 Mar 2015 13:56:05 -0400)
OS: Unknown (Unix 18.104.22.168)
RUNTIME: 3.12.0 (tarball) (64-bit)
GC: 2 GENERATIONS
HELP: False ()
VERSION: False ()
LOG: /event-store-data/log (Command Line)
WHAT IF: True (Command Line)
INT IP: 127.0.0.1 ()
EXT IP: 10.243.0.211 (Command Line)
INT HTTP PORT: 2112 ()
EXT HTTP PORT: 2113 ()
INT TCP PORT: 1112 ()
INT SECURE TCP PORT: 0 ()
EXT TCP PORT: 1113 (Command Line)
EXT SECURE TCP PORT: 0 ()
INT TCP HEARTBEAT TIMEOUT: 700 ()
EXT TCP HEARTBEAT TIMEOUT: 1000 ()
INT TCP HEARTBEAT INTERVAL: 700 ()
EXT TCP HEARTBEAT INTERVAL: 2000 ()
FORCE: False ()
CLUSTER SIZE: 3 (Command Line)
NODE PRIORITY: 0 ()
MIN FLUSH DELAY MS: 2 ()
COMMIT COUNT: -1 ()
PREPARE COUNT: -1 ()
ADMIN ON EXT: True ()
STATS ON EXT: True ()
GOSSIP ON EXT: True ()
DISABLE SCAVENGE MERGING: False ()
DISCOVER VIA DNS: False (Command Line)
CLUSTER DNS: fake.dns ()
CLUSTER GOSSIP PORT: 30777 ()
GOSSIP SEED: [ 10.243.4.120:2112, 10.243.8.160:2112 ] (Command Line)
STATS PERIOD SEC: 30 ()
CACHED CHUNKS: -1 ()
CHUNKS CACHE SIZE: 536871424 ()
MAX MEM TABLE SIZE: 1000000 ()
DB: /event-store-data/db (Command Line)
MEM DB: False ()
SKIP DB VERIFY: False ()
RUN PROJECTIONS: All (Command Line)
PROJECTION THREADS: 3 ()
WORKER THREADS: 5 ()
HTTP PREFIXES: [ http://*:2113/ ] (Command Line)
ENABLE TRUSTED AUTH: False ()
CERTIFICATE STORE LOCATION: ()
CERTIFICATE STORE NAME: ()
CERTIFICATE SUBJECT NAME: ()
CERTIFICATE THUMBPRINT: ()
CERTIFICATE FILE: ()
CERTIFICATE PASSWORD: ()
USE INTERNAL SSL: False ()
SSL TARGET HOST: n/a ()
SSL VALIDATE SERVER: True ()
AUTHENTICATION TYPE: internal ()
PREPARE TIMEOUT MS: 2000 ()
COMMIT TIMEOUT MS: 2000 ()
UNSAFE DISABLE FLUSH TO DISK: False ()
GOSSIP INTERVAL MS: 1000 ()
GOSSIP ALLOWED DIFFERENCE MS: 60000 ()
GOSSIP TIMEOUT MS: 500 ()
[00001,01,21:48:32.176] Exiting with exit code: 0.
Exit reason: WhatIf option specified
also : When we ran clusters on non-virtual hardware, we also saw very
frequent elections. I would have to see logs to have any comment.
" Can you give us an idea of the (kernel) resources accessed during
simple message processing, so that we can dig deeper?"
Depends on the message. Some of the ones above would appear to just be
CPU so I would watch gc pauses etc as a start.
Btw you have internal timeouts set for 500/700ms. This really isn’t much room especially under load in the cloud and likely you will get timeouts. These are listed below as timeouts (internal and gossip).
Apologies, looks like the changes were regressed after Tao’s original run. We’re correcting that now. We have never changed the internal timeout before. We’ll bump both timeouts to 2000ms, and re-run.
As far as your GC suspicion, if there any data we can gather, let us know. Also, we can arrange access to our test cluster in AWS, if that would help.
Update: we have gotten better runs, for sure. Many hours under constant load. However, when a network partition does exceed the timeouts, we do still get the suicides. I think we understand about 90% of this now, but there are some edge cases we have questions about. I want to gather our thoughts before we ask. In the meantime, I think James said that the behavior is deterministic given a precise outage pattern? Do you guys have docs on those behaviors?
The situations is this (I will explain with network cables if thats
ok). The situation is a well known (academically) edge condition known
as "deposed master"
So bring up a cluster with three nodes A,B,C
Cluster decides: A master, B slave, C slave.
Start a client writing say 500 messages/second to cluster.
At this point pull the network cable of A to introduce a partition
(master in minority). The other two nodes will form their own cluster
let's imagine C becomes master (non-deterministic which one is new
master). Wait about 1 minute then plug network cable back in on A.
When A comes up A has some writes that were never acknowledged by B or
C (e.g. his log doesn't match). At this point he will say "I need to
go down for truncation". We could do this online but its very very
complicated while hari kari is quite simple here
Yep thats the 90% part we get.
One edge case we saw: BOTH slaves suicided when the old master became visible again. This downs the cluster, and then the master falls back to an “unknown” state.
The outage was 2.8 seconds, and the timeouts were 2 seconds. We expect that the old master would “self-demote” because it was past the gossip window. We thought it would be the one that suicided, but the other boxes did instead. Any idea why?
(We can also send you logs from all three boxes, if that would help)
I'm not entirely clear how the slaves suicided (to be fair we run
24/365 randomized testing on this) .. Logs would be good on this.
Not a problem. I can send them today. Where shall I send them?
Just send to my address.