What situations do you think are important for a troubleshooting doc?
define "clustering failures"
You mean like how to trouble shoot your configuration?
How to recover from a corrupt dataset.
Yes. There is, of course, a lot of complexity that is not covered well in the docs around getting it stood up the first time. And for example, I finally got a cluster up and it seemed fine and worked for awhile but performance started to be off. After some investigation, I figured out that I was having elections constantly. I looked at it with James and he was like, “Oh, it looks like your timeouts are too short for small AWS instances, raise them”. Which I did and it was fine. (Incidentally, I have a hard time believing that three nodes in a single AWS region are CONSTANTLY failing to get gossip messages sent across the internal network in 200ms, since none of my testing has shown that network to be anything less than “ok” but I digress…)
Things like that. How do you know it’s up correctly? How do you know it’s still up? How do you fix things that may come up like a failed node (especially since I believe on AWS right now you have to hard-code a list of ip addresses for nodes for gossip) - things like this post points out: https://groups.google.com/forum/#!topic/event-store/7tOrBsfuYps
There is nothing that says you have to hardcore node ip addresses. Use DNS.even with that those addresses are just used to seed gossip (eg initial seed all I formation from that point forward is decided by gossip).
I was having elections every 10 seconds or so continuously.
And about the DNS. Right, it SHOULD work (and it will even start up with DNS), but with AWS there seems to be that issue where if you use DNS it stops being able to forward messages. You and I were discussing that and never seemed to get it resolved. That goes away if you don’t use DNS on AWS. I never found out why.
I don’t remember DNS being involved with any issue about forwarding messages please forward to me. I’ve seen issues with forwarding but none involved DNS.
What was load on the boxes. The timeout is end to end sitting at 100% cup you could easily not reply in 200ms
Almost no load. ES was the only thing running on the (t2.medium) instances, and less than 1 request a minute was being made on the cluster.
I am assuming your question is literal. That you want the scenarios, not the troubleshooting steps. The ones that spring to mind are:
No node can be elected (due to failure or paritioning).
A node has recovered (or joined) but has not caught up.
A projection is way behind (due to CPU load, etc).
The node is running digest checks across a large DB (long wait for subscribers).
The operator forgot run-projections=all, or forgot to enable system projections.
Throughput is very low when writing to a cluster.
A writer/reader isn’t setting “PerformOnMasterOnly,” and it should be doing so.
A subscriber connected to a totally different DB instance by accident.
The local node’s disk or mem is exhausted.
The http prefix and or ports are misconfigured such that the node is in some way isolated.
The Mono CPU problem on Linux.
The Linux 14.04/CoreOS 706/Mono/whatever Scheduler exception.
ACL madness due to operator error.
More generally, operator error on many of the settings: SSL Certs, gossip timeouts, DNS settings, you name it.
+1 This list is a good starting point.
I’d like to see an error message reference with a bit more explanation of causes.
Yeah I saw this as being part of it especially for common ones.
Truncation Required as example.