Slow slave catchup following write flood with large events

Chris_Elgin · June 13, 2017, 9:54pm

I’m doing some performance testing on a new cluster in Azure to determine the best configuration (# disks, CPU, memory, stripe size, # NICs, etc). I’m using Phil Bolduc’s TestClientAPI (thanks Phil!) to test write floods, and I’m observing a behavior I can’t explain. When I execute a write flood that writes a number of relatively large events (in this case, 100k events across 1000 streams on 10 clients, each event containing 3430 chars) the write flood finishes after about 14 seconds, but the slave nodes take 2-3 minutes to catch up. It appears this catchup happens around 10 Mb/s per node - which seems incredibly slow as I can read all the events via a catchup subscription from a client at least 30x as fast. There doesn’t appear to be any significant or limiting activity on mater or the nodes - low CPU, memory, disk, etc.

One note from the logs is I’m seeing 5-6k IDEMPOTENT WRITE TO STREAM messages in the logs when I do a flood with events sized 256 chars, and 20k when I do the flood with 3430 char events - no errors logged.

Anyway, I imagine this is just a nuance of ES mechanics I don’t yet understand, but I would appreciate some guidance on what I’m seeing.

Specs:

4 VMs running in the same subnet, all of spec Standard_DS3_V2 - 3 database nodes, and one console node that executes the tests
Each database node running 4TB data disk (4 1TB P30 disks striped)
Running 3.9.4 at the moment

Thanks,

Chris

Greg_Young1 · June 13, 2017, 9:57pm

This is not possible as at least one slave must ack in a cluster of 3
nodes before responding. Likely the test is broken or you are
misunderstanding the results.

Chris_Elgin · June 14, 2017, 6:12pm

Now that I’ve discovered the “official” EventStore.TestClient.exe, I can confirm WRFL behaves as expected, as the slaves are very close behind master at all times. I’ll do some digging to try and figure out where this other test harness goes wrong.

Thanks!

Greg_Young1 · June 14, 2017, 6:25pm

My guess is the other test just does unbounded writes reporting their
start not their completion.