At Xero we’ve asked the exact same questions.
I’ve pulled together a document which describes our intended approach to testing and monitoring ES.
This doc is going through final review internally before we start implementing everything described in it.
We’ve attempted to write this document in a way which could be useful to people outside of Xero, and it’s based on an informal meeting with PieterG from GetEventStore and one round of internal feedback here.
It’s not quite complete and ready for publishing - but since you’ve asked the question, i’ll post what we have to date. Hopefully it’s of some use - it doesn’t provide the bits you’re after but it does describe what is hopefully a fairly exhaustive Ops story. And there’s some useful background info from ES that provides answers to what I imagine are fairly common questions.
Once a final version is done (and hopefully reviewed by GetEventStore) then i’ll post an update to this forum…
cheers,
justin
EventStore testing
This document presents an approach to verifying ES cluster performance, and also which ES internal metrics and logs we should be monitoring to detect cluster and/or node issues.
The current testing approach is based on a single ES cluster in a single region - multi-region ES will be considered later.
This document is written for a specific environment - which happens to be linux hosts running in AWS, sumologic and either Datadog or Prometheus for logs/metrics.
Open questions
-
what is a reasonable drift for writers to be behind the master?
-
How do we use the Manager and watchdogs?
-
Is there a way to monitor HTTP response codes from the cluster? Or is there anyway to directly correlate these with the EventStore logs? (e.g. we’re seeing occasional HTTP 500’s in our Prod cluster - is there a message we’d see in the logs that would match these events?)
General notes
Useful info that came out of general discussion/questions with ES and AWS.
EventStore
-
slow messages - in general low value. defn of “slow” not configurable. value = 150-250ms, anything above that is slow.
-
ebs - consider which ssd type. has a definite impact on ES cluster perf and likelihood of getting “slow” warnings.
-
typical prod issue is deposed/partitioned master.
-
storage has a big impact on perf and is often the cause of prod issues.
-
if writecheckpoint never catching up to master checkpoint - could indicate network-related, storage.
-
2+ masters indicates a network partition.
-
/gossip contains all nodes known to cluster, incl status, master/slave, whether nodes are behind/current checkpoint.
-
if nodes flip/flop, go to logs - source of info why things flipping.
eg if deposed master separated from rest of cluster but doesn’t die for a while. comes up, gets told to die and truncate, restart.
-
storagereaderqueues - 4 by default. In charge of servicing read requests from clients.
if these start backing up and not draining, maybe slow disk, network. also likely logged out as warnings.
-
writerqueue - single
-
node startup has multiple stages.
-
firstly is a node
-
joined cluster - catching up
-
once status is master, clone, slave then caught up and accepting writes.
clone = a node not req to meet configured cluster config size (ie extra node)
-
cluster replication is over tcp, everything else incl cluster communications and election, etc all over http.
-
ntp - problems if exceeding 60s time drift btwn nodes
-
after discussion, agreed prefer to put resources into monitoring ES cluster health over canary events
-
when writes/second gets > 4K/second need to start seriously thinking about storage perf.
-
15k-20k reads/second should be no problem for a well-configured cluster and appropriate storage (for eg t2-med, std ssd esb, 3 node cluster)
-
stats are published every 60 sec
-
(large) stats published every 60s will count towards backup size. can put maxage or maxcount on stats-ipaddress stream to reduce volume.
-
one ES customer builds an entire new cluster each morning - rebuild cluster, restore backups from ebs - have large downtime each morning to be able to do this.
-
restoring is essentially taking a EBS volume snapshot, attaching volume, restoring snapshot, joining node to cluster.
-
to alter user logins;
PUT /users//command/change-password (basic auth with default creds, then change them). As per https://github.com/EventStore/EventStore/blob/08c2bdf7dcadd154cffa549d273e3a8e4673c5a1/src/EventStore.ClientAPI/UserManagement/UsersClient.cs#L90
AWS
-
Use attached storage SSDs rather than EBS
-
SSDs have much higher IOPS which is important in restore scenario
-
Restoring an EBS snapshot will lazy load disk blocks as they are requested
-
When restoring an ES node from a backup, it is necessary to have all the chunks available on disk before starting up the node. Therefore all disk blocks must be restored (whether from EBS snapshot or S3) first.
-
Preference therefore is for explicitly copying all disk blocks from S3 rather than background copy via ESB snapshot.
-
Use EBS for the root volume.
So the OS, EventStore binaries, any other management agents, etc, would be baked into that EBS volume.
-
Use Instance Store volumes for the EventStore data itself.
When you create a new instance you’ll always be populating EventStore data from another source (e.g. another EventStore node, or S3) as opposed to having pre-populated EventStore data in an EBS snapshot.
-
So:
/ - EBS volume. OS, EventStore binaries in /bin, etc
/data - InstanceStore volume0. EventStore data
/spare - InstanceStore volume1. Spare, swap, whatever
-
AMI creation would add Instance Stores to the AMI: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/add-instance-store-volumes.html#adding-instance-storage-ami
-
Will probably start with
i3.large (0.475 TB SSD) + EBS ami root volume + attached SSD for the eventstore data
Some larger instances have multiple instance stores, so the i2.8xlarge comes with 8*800GB instance store volumes.
-
Instance store volunes are attached at instance creation time. I think the 0.475 is only one volume, so you’d have only two volumes:
/ - An EBS volume holding OS
/Data - the 0.475TB ssd instance store volume
Infrastructure specifications
Assumptions
Every ES node contains
Capture
ES logs
via log agent on every node
logs on linux are in /var/log/eventstore
-
ensure log agent is configured to pick up the ES error log (/err)
-
also ensure log agent is configured to pick up the std logs. (we may want to remove this later, depending on log sizes/verbosity)
-
use log rotation / deletion to reduce file storage required after longs have been collected by the log agent
ES metrics
Two possible modes of capture;
-
Prometheus - pulls from an ES http monitoring endpoint
-
put an ELB in front of all nodes for monitoring
-
prometheus connects to the ELB, will resolve to a single node
-
If currently resolved node dies, ELB will resolve to another live node
-
Cron job that runs on each ES node and initiates local job
-
Job first queries local ES to determine if node is ES master
-
If node is not ES master then exit
-
else write result of ES monitoring query to directory monitored by log agent
-
write metric showing cron job result;
-
metric label = persistence.eventstore.cron.*
-
ES node type
-
elapsed time
-
job result (success/failure)
every 1 second
every 1 minute
Automated tests
Automated tests run once as part of ES cluster provisioning. They are intended to prove the cluster is performant.
Tests are to write json output and push results to a directory monitored by log agent.
-
wrfl
5 clients, 500K requests, 200K streams, 1K payload
-
rdfl
5 clients, 10M streams, 50M events, 1K payload
-
verify
as implemented by TestClient
Once the cluster is approved for production use then test data should be cleaned out;
-
Stop ES cluster
-
Delete contents of ES data dir
-
Start ES cluster
Backup
Each backup should be written to metrics
-
Backup started
-
metric label = persistence.eventstore.backup.started
-
node identifier
-
Backup finished
-
metric label = persistence.eventstore.backup.finished
-
size
-
elapsedtime (seconds)
-
node identifier
Restore
Each restore should be written to metrics
-
Restore started
-
metric label = persistence.eventstore.restore.started
-
node identifier
-
Restore finished
-
metric label = persistence.eventstore.restore.finished
-
size
-
elapsedtime (seconds)
-
status (success/failure)
-
node identifier
Publish all restore logs to directory monitored by log agent
Chaos Monkey
Each chaos check and hit (where chaos decides to terminate something) should be written to metrics
-
Publish chaos checks to metrics
-
metric label = persistence.eventstore.chaos.check
-
Publish chaos hits to metrics
-
metric label = persistence.eventstore.chaos.hit
-
node identifier for node being terminated
-
Publish chaos hits to log agent, including node identifier
DR
Backup
-
Verify single master node (via GET /info), choose master as source of backup
-
Run S3 Sync every 30min to S3 bucket in same AWS account
-
Ensure S3 bucket has cross-account replication to separate AWS backup account
-
Execute via SSM, or (this may be broken atm) cron job running on each node as per ES metrics capture.
Restore
Three restore scenarios;
-
Every day restore a new node at 5PM UTC
-
once complete then terminate oldest node in cluster
-
Whenever chaos terminates a node then restore a new node from S3
-
Failure scenario - testing what happens when backup is bad
-
restore node from bogus S3 backup
-
ensure node is rejected by cluster
-
determine exactly what logs/gossip should be looked for to indicate node restore/cluster join failure
Restore procedure
-
Stand up new node instance, with ES service not running
-
Restore node storage from latest S3 backup
-
Overwrite truncate.chk with chaser.chk
-
Start ES service to join node to cluster
Node will be in catchup mode, catchup will verify node data matches
-
Success indicated by node reaching slave status
Failure indicated by cluster rejecting the node - will cause node to suicide and truncate data - this will be found in the logs
Monitor
Following dashboards and queries to be built
Datadog
-
EBS / Instance Store (SSD) volumes
-
CloudWatch metrics for bandwidth / throughput / queue length / latency / consumed ops / burst balance
-
Individual nodes
-
cpu load/perf
-
GC
-
storagereaderqueues
-
writerqueue
-
writes/second
-
reads/second
-
node cluster status
-
Cluster
-
connections
-
node statuses
-
node uptime
-
node catchup point
-
ntp differential
-
backup size and result and elapsed time
-
restore result and elapsed time
-
Clusters
(all environments)
-
node statuses
-
Backup
-
SSM Association running backups to S3. Could monitor last write time to S3 and alert if it’s > backup interval?
If not using SSM for backups then monitoring results of backup cron job
Sumologic
-
ES error logs
-
ES gossip
-
ES stats
-
Automated test results
-
ES users/connections
-
Backup failure logs
Alerts
Alerts for following conditions to be setup;
-
not exactly 1 master in cluster
-
not at least 1 slave
(for 5-node cluster actually want at least 3 slaves)
-
actual cluster size < configured cluster size for longer than 1 hour
-
configured cluster size != actual cluster size / asg size
-
cluster health GET /gossip fails/stops producing monitoring data
-
node status flip/flopping for any node
-
storagereaderqueues backing up and not draining
-
any node status is not one of master, clone, slave and node is > 30 min old
-
ntp diff between any node > 30 sec
-
writes/second > 4K
-
reads/second > 20K
-
restore.elapsedtime > 30 min
-
backup.elapsedtime > 1 hour
-
avail disk space < 30% total disk space
-
no nodes are < 2 days old (would indicate nodes not being refreshed from daily backup)
Chaos Monkey
Purpose of chaos monkey is to randomly kill a node and verify that automated recovery/restore/rejoining the cluster/catchup/monitoring is working
-
Update Chaos monkey to find and kill an ES node no more than once/day. Can be any node, ie do not exclude master.
-
Do not execute chaos if actual cluster size < configured cluster size
-
Do not execute chaos if any nodes not in healthy state
-
Provide a semi-automated way to manually disable chaos invocations