Testing and verifying a cluster setup

Kristoffer_Ahl · March 16, 2017, 9:16am

Do anyone know of any open source projects that can aid in testing and verifying the stability of a cluster?
We’re looking for something that can:

look for signs that the cluster has not been configured properly
put a cluster under stress/continuous load
create chaos by killing nodes in the cluster
simulate flaky network

It does not have to tick all the boxes…

Any help would be appreciated!

Greg_Young1 · March 16, 2017, 2:55pm

So we do this kind of testing,

http://blog.garytully.com/2008/10/testing-simulating-network-failure.html
as example can simulate a partition.

We generally use actual power pulls though (can cause slightly
different behaviour than just having a partition). As example
https://www.geteventstore.com/blog/20130708/testing-event-store-ha/

Justin_Thirkell · March 17, 2017, 1:09am

At Xero we’ve asked the exact same questions.

I’ve pulled together a document which describes our intended approach to testing and monitoring ES.

This doc is going through final review internally before we start implementing everything described in it.

We’ve attempted to write this document in a way which could be useful to people outside of Xero, and it’s based on an informal meeting with PieterG from GetEventStore and one round of internal feedback here.

It’s not quite complete and ready for publishing - but since you’ve asked the question, i’ll post what we have to date. Hopefully it’s of some use - it doesn’t provide the bits you’re after but it does describe what is hopefully a fairly exhaustive Ops story. And there’s some useful background info from ES that provides answers to what I imagine are fairly common questions.

Once a final version is done (and hopefully reviewed by GetEventStore) then i’ll post an update to this forum…

cheers,

justin

EventStore testing

This document presents an approach to verifying ES cluster performance, and also which ES internal metrics and logs we should be monitoring to detect cluster and/or node issues.

The current testing approach is based on a single ES cluster in a single region - multi-region ES will be considered later.

This document is written for a specific environment - which happens to be linux hosts running in AWS, sumologic and either Datadog or Prometheus for logs/metrics.

Open questions

what is a reasonable drift for writers to be behind the master?
How do we use the Manager and watchdogs?
Is there a way to monitor HTTP response codes from the cluster? Or is there anyway to directly correlate these with the EventStore logs? (e.g. we’re seeing occasional HTTP 500’s in our Prod cluster - is there a message we’d see in the logs that would match these events?)

General notes

Useful info that came out of general discussion/questions with ES and AWS.

EventStore

slow messages - in general low value. defn of “slow” not configurable. value = 150-250ms, anything above that is slow.
ebs - consider which ssd type. has a definite impact on ES cluster perf and likelihood of getting “slow” warnings.
typical prod issue is deposed/partitioned master.
storage has a big impact on perf and is often the cause of prod issues.
if writecheckpoint never catching up to master checkpoint - could indicate network-related, storage.
2+ masters indicates a network partition.
/gossip contains all nodes known to cluster, incl status, master/slave, whether nodes are behind/current checkpoint.
if nodes flip/flop, go to logs - source of info why things flipping.
eg if deposed master separated from rest of cluster but doesn’t die for a while. comes up, gets told to die and truncate, restart.
storagereaderqueues - 4 by default. In charge of servicing read requests from clients.
if these start backing up and not draining, maybe slow disk, network. also likely logged out as warnings.
writerqueue - single
node startup has multiple stages.
firstly is a node
joined cluster - catching up
once status is master, clone, slave then caught up and accepting writes.
clone = a node not req to meet configured cluster config size (ie extra node)
cluster replication is over tcp, everything else incl cluster communications and election, etc all over http.
ntp - problems if exceeding 60s time drift btwn nodes
after discussion, agreed prefer to put resources into monitoring ES cluster health over canary events
when writes/second gets > 4K/second need to start seriously thinking about storage perf.
15k-20k reads/second should be no problem for a well-configured cluster and appropriate storage (for eg t2-med, std ssd esb, 3 node cluster)
stats are published every 60 sec
(large) stats published every 60s will count towards backup size. can put maxage or maxcount on stats-ipaddress stream to reduce volume.
one ES customer builds an entire new cluster each morning - rebuild cluster, restore backups from ebs - have large downtime each morning to be able to do this.
restoring is essentially taking a EBS volume snapshot, attaching volume, restoring snapshot, joining node to cluster.
to alter user logins;
PUT /users//command/change-password (basic auth with default creds, then change them). As per https://github.com/EventStore/EventStore/blob/08c2bdf7dcadd154cffa549d273e3a8e4673c5a1/src/EventStore.ClientAPI/UserManagement/UsersClient.cs#L90

AWS

Use attached storage SSDs rather than EBS
SSDs have much higher IOPS which is important in restore scenario
Restoring an EBS snapshot will lazy load disk blocks as they are requested
When restoring an ES node from a backup, it is necessary to have all the chunks available on disk before starting up the node. Therefore all disk blocks must be restored (whether from EBS snapshot or S3) first.
Preference therefore is for explicitly copying all disk blocks from S3 rather than background copy via ESB snapshot.
Use EBS for the root volume.
So the OS, EventStore binaries, any other management agents, etc, would be baked into that EBS volume.
Use Instance Store volumes for the EventStore data itself.
When you create a new instance you’ll always be populating EventStore data from another source (e.g. another EventStore node, or S3) as opposed to having pre-populated EventStore data in an EBS snapshot.
So:
/ - EBS volume. OS, EventStore binaries in /bin, etc
/data - InstanceStore volume0. EventStore data
/spare - InstanceStore volume1. Spare, swap, whatever
AMI creation would add Instance Stores to the AMI: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/add-instance-store-volumes.html#adding-instance-storage-ami
Will probably start with
i3.large (0.475 TB SSD) + EBS ami root volume + attached SSD for the eventstore data
Some larger instances have multiple instance stores, so the i2.8xlarge comes with 8*800GB instance store volumes.
Instance store volunes are attached at instance creation time. I think the 0.475 is only one volume, so you’d have only two volumes:
/ - An EBS volume holding OS
/Data - the 0.475TB ssd instance store volume

Infrastructure specifications

Assumptions

Every ES node contains

log agent
metrics agent

Capture

ES logs

via log agent on every node

logs on linux are in /var/log/eventstore

ensure log agent is configured to pick up the ES error log (/err)
also ensure log agent is configured to pick up the std logs. (we may want to remove this later, depending on log sizes/verbosity)
use log rotation / deletion to reduce file storage required after longs have been collected by the log agent

ES metrics

Two possible modes of capture;

Prometheus - pulls from an ES http monitoring endpoint
put an ELB in front of all nodes for monitoring
prometheus connects to the ELB, will resolve to a single node
If currently resolved node dies, ELB will resolve to another live node
Cron job that runs on each ES node and initiates local job
Job first queries local ES to determine if node is ES master
If node is not ES master then exit
else write result of ES monitoring query to directory monitored by log agent
write metric showing cron job result;
metric label = persistence.eventstore.cron.*
ES node type
elapsed time
job result (success/failure)

every 1 second

GET /gossip

every 1 minute

GET /stats
GET /stats/tcp
GET /projections/all-non-transient
GET /users
GET /subscriptions
GET /streams/$scavenges?embed=tryharder (Accept=application/json)

Automated tests

Automated tests run once as part of ES cluster provisioning. They are intended to prove the cluster is performant.

Tests are to write json output and push results to a directory monitored by log agent.

wrfl
5 clients, 500K requests, 200K streams, 1K payload
rdfl
5 clients, 10M streams, 50M events, 1K payload
verify
as implemented by TestClient

Once the cluster is approved for production use then test data should be cleaned out;

Stop ES cluster
Delete contents of ES data dir
Start ES cluster

Backup

Each backup should be written to metrics

Backup started
metric label = persistence.eventstore.backup.started
node identifier
Backup finished
metric label = persistence.eventstore.backup.finished
size
elapsedtime (seconds)
node identifier

Restore

Each restore should be written to metrics

Restore started
metric label = persistence.eventstore.restore.started
node identifier
Restore finished
metric label = persistence.eventstore.restore.finished
size
elapsedtime (seconds)
status (success/failure)
node identifier

Publish all restore logs to directory monitored by log agent

Chaos Monkey

Each chaos check and hit (where chaos decides to terminate something) should be written to metrics

Publish chaos checks to metrics
metric label = persistence.eventstore.chaos.check
Publish chaos hits to metrics
metric label = persistence.eventstore.chaos.hit
node identifier for node being terminated
Publish chaos hits to log agent, including node identifier

DR

Backup

Verify single master node (via GET /info), choose master as source of backup
Run S3 Sync every 30min to S3 bucket in same AWS account
Ensure S3 bucket has cross-account replication to separate AWS backup account
Execute via SSM, or (this may be broken atm) cron job running on each node as per ES metrics capture.

Restore

Three restore scenarios;

Every day restore a new node at 5PM UTC
once complete then terminate oldest node in cluster
Whenever chaos terminates a node then restore a new node from S3
Failure scenario - testing what happens when backup is bad
restore node from bogus S3 backup
ensure node is rejected by cluster
determine exactly what logs/gossip should be looked for to indicate node restore/cluster join failure

Restore procedure

Stand up new node instance, with ES service not running
Restore node storage from latest S3 backup
Overwrite truncate.chk with chaser.chk
Start ES service to join node to cluster
Node will be in catchup mode, catchup will verify node data matches
Success indicated by node reaching slave status
Failure indicated by cluster rejecting the node - will cause node to suicide and truncate data - this will be found in the logs

Monitor

Following dashboards and queries to be built

Datadog

EBS / Instance Store (SSD) volumes
CloudWatch metrics for bandwidth / throughput / queue length / latency / consumed ops / burst balance
Individual nodes
cpu load/perf
GC
storagereaderqueues
writerqueue
writes/second
reads/second
node cluster status
Cluster
connections
node statuses
node uptime
node catchup point
ntp differential
backup size and result and elapsed time
restore result and elapsed time
Clusters
(all environments)
node statuses
Backup
SSM Association running backups to S3. Could monitor last write time to S3 and alert if it’s > backup interval?
If not using SSM for backups then monitoring results of backup cron job

Sumologic

ES error logs
ES gossip
ES stats
Automated test results
ES users/connections
Backup failure logs

Alerts

Alerts for following conditions to be setup;

not exactly 1 master in cluster
not at least 1 slave
(for 5-node cluster actually want at least 3 slaves)
actual cluster size < configured cluster size for longer than 1 hour
configured cluster size != actual cluster size / asg size
cluster health GET /gossip fails/stops producing monitoring data
node status flip/flopping for any node
storagereaderqueues backing up and not draining
any node status is not one of master, clone, slave and node is > 30 min old
ntp diff between any node > 30 sec
writes/second > 4K
reads/second > 20K
restore.elapsedtime > 30 min
backup.elapsedtime > 1 hour
avail disk space < 30% total disk space
no nodes are < 2 days old (would indicate nodes not being refreshed from daily backup)

Chaos Monkey

Purpose of chaos monkey is to randomly kill a node and verify that automated recovery/restore/rejoining the cluster/catchup/monitoring is working

Update Chaos monkey to find and kill an ES node no more than once/day. Can be any node, ie do not exclude master.
Do not execute chaos if actual cluster size < configured cluster size
Do not execute chaos if any nodes not in healthy state
Provide a semi-automated way to manually disable chaos invocations