I’m doing some performance testing on a new cluster in Azure to determine the best configuration (# disks, CPU, memory, stripe size, # NICs, etc). I’m using Phil Bolduc’s TestClientAPI (thanks Phil!) to test write floods, and I’m observing a behavior I can’t explain. When I execute a write flood that writes a number of relatively large events (in this case, 100k events across 1000 streams on 10 clients, each event containing 3430 chars) the write flood finishes after about 14 seconds, but the slave nodes take 2-3 minutes to catch up. It appears this catchup happens around 10 Mb/s per node - which seems incredibly slow as I can read all the events via a catchup subscription from a client at least 30x as fast. There doesn’t appear to be any significant or limiting activity on mater or the nodes - low CPU, memory, disk, etc.
One note from the logs is I’m seeing 5-6k IDEMPOTENT WRITE TO STREAM messages in the logs when I do a flood with events sized 256 chars, and 20k when I do the flood with 3430 char events - no errors logged.
Anyway, I imagine this is just a nuance of ES mechanics I don’t yet understand, but I would appreciate some guidance on what I’m seeing.
Specs:
-
4 VMs running in the same subnet, all of spec Standard_DS3_V2 - 3 database nodes, and one console node that executes the tests
-
Each database node running 4TB data disk (4 1TB P30 disks striped)
-
Running 3.9.4 at the moment
Thanks,
Chris