Help diagnosing ES performance/reliability issues

We’re having a REALLY hard time troubleshooting an EventStore cluster performance issue.

Since we use Ruby, we’ve built our own HTTP client libraries and the throughput is just terrible. But we’re having a hard time understanding where the problem is exactly.

In production we are running a 3-node cluster on Softlayer dedicated hardware, 4 cores, 8GB ram, pretty standard HDD setups, with an HAProxy load balancer.

We are stuck at about 50 requests/sec in and out with about a 1:1 read/write ratio.

Read-only, we see about 100 requests/sec (this varies based on disk/ram caching).

Our dataset is about 50 gb.

We also get 500 errors out of EventStore pretty often (once every minute or two, sometimes on writes, sometimes on reads) - though the data seems to always be intact when it does successfully write.

We’ve gotten to the point where this is an unusable tool for us due to the reliability and throughput issues, and are seriously considering migrating away from this as our main event storage layer.

We’re putting together some tooling to try to benchmark this well comparing ruby clients, and I just spent a few hours trying to build some knid of corresponding setup using the .NET client since it seems that’s what everyone else is using, but got totally stumped by the vagaries of mono/nuget/OS X.

Is there some kind of sample/benchmarking tool that anyone has that can let us at least diagnose what our current hardware should be capable of in an ideal client world?

Thanks

have you tried testclient? it has tests for both TCP and http clients.

I JUST discovered it.

Trying to understand the `HELP` command. Do you have a sample commands you could share?

WRFL is a write flood (over http)
WRFL threads/total requests is the most common
for http you can easily test with such tools as wrk

I should add our test are one-two orders of magnitude off from yours
using things like wrk or abbench

Alright, so I have that app figured out I think, and am getting some numbers from the production cluster (this is just reading to avoid dumping a bunch of junk data into the db). The numbers at first seemed ok, but then I seem to have broken my cluster.

I think the kernel is correct on these:

$ uname -r

3.13.0-100-generic


>>> RDFL 100 10000

[22101,07,20:52:34.493] Processing command: RDFL 100 10000.

##########fastest: 0

quintiles

20% : 0-40

40% : 40-64

60% : 64-94

80% : 94-153

100% : 153-1282

90% : 1270

95% : 1276

98% : 1279

99% : 1280

99.5% : 1281

99.9% : 1282

99.99% : 1282

99.999% : 1282

99.9999% : 1282

99.99999% : 1282

Highest : 1282

[22101,06,20:52:35.101] Completed. READS succ: 0, fail: 10000.

[22101,06,20:52:35.101] 10000 requests completed in 592ms (16891.89 reqs per sec).

[22101,06,20:52:35.106] [[begin

DataName:RDFL;

clientsCnt:100;requestsCnt:10000;ElapsedMilliseconds:592;

readsCnt:10000;

end]]

[22101,06,20:52:35.106]

##teamcity[buildStatisticValue key='RDFL-100-10000-reqPerSec' value='16891']

[22101,07,20:52:35.106] Command exited with code 0.

>>> RDFL 100 1000000

[22101,07,20:53:01.782] Processing command: RDFL 100 1000000.

#[22101,215,20:53:02.171] Global Unhandled Exception occurred.

Object reference not set to an instance of an object

####[22101,216,20:53:02.747] Global Unhandled Exception occurred.

Object reference not set to an instance of an object

##[22101,216,20:53:02.810] Global Unhandled Exception occurred.

Object reference not set to an instance of an object

##########################################################################[22101,217,20:53:05.608] Global Unhandled Exception occurred.

Object reference not set to an instance of an object

#########[22101,217,20:53:05.915] Global Unhandled Exception occurred.

Object reference not set to an instance of an object

########[22101,217,20:53:06.160] Global Unhandled Exception occurred.

Object reference not set to an instance of an object

[22101,197,20:53:06.166] Global Unhandled Exception occurred.

Object reference not set to an instance of an object

#[ERROR] FATAL UNHANDLED EXCEPTION: System.NullReferenceException: Object reference not set to an instance of an object

at EventStore.TestClient.Commands.ReadFloodProcessor+<ReadFlood>c__AnonStorey1.<>m__2 () [0x00000] in <filename unknown>:0

at System.Threading.Thread.StartInternal () [0x00000] in <filename unknown>:0

>>> RDFL 100 100000

Segments count: 1, buffers count: 512, should be when full: 512

[22472,07,20:55:17.205] Processing command: RDFL 100 100000.

########################[22472,110,20:55:18.620] Global Unhandled Exception occurred.

Object reference not set to an instance of an object

###[22472,112,20:55:18.725] Global Unhandled Exception occurred.

Object reference not set to an instance of an object

##########################[22472,112,20:55:19.679] Global Unhandled Exception occurred.

Object reference not set to an instance of an object

############################[22472,118,20:55:22.380] Global Unhandled Exception occurred.

Object reference not set to an instance of an object

###[22472,07,20:55:22.729] Error during execution of command: Connection was closed prematurely..

[22472,07,20:55:22.729] Command exited with code 1.

#>>> ###########[22472,119,20:55:23.182] Global Unhandled Exception occurred.

Object reference not set to an instance of an object

RDFL is a read flood, it also looks like you are on a kernel with a
regression also it needs a wr command before being used. Normally we
run way more than 1000 requests (more like 100m to look at things)

Right, I was just trying it with reads because I didn’t want to dump a bunch of diagnostic data (writes) into the cluster.

What does `it needs a wr command before being used` mean?

it reads the same event written by the WR command. Without doing that
first you are testing reading non-existing events.

Alright,

So this is where I am at -

3 node cluster, dedicated physical hardware at SoftLayer with 1TB HDD’s (not SSD). 8GB RAM.

Single node read throughput of single events (of my “real” data) is capped at about 3k/sec (~ 3.5 MB/sec).

Even if I do the requests from the same node so any network traversal is eliminated, it’s about the same.

With the TCP client ([buildStatisticValue key=‘RDFL-20-200000-reqPerSec’ value=‘80096’]), I get more like 80k reqs/sec.

There seems to be some kind of “modal” issue with ES in that sometimes I’ll see a pretty high (~ 1%), but I’m unable to reproduce that reliably at this point.

Below is my ES config, and the tests.

Can anyone suggest where I could look for glaring config/setup problems?

When you do rdfl I assume you did a wr before? Likely that read is
cached where as your other test is likely not cached. Using a single
spinning disk you are likely seeing seek thrashing when it hits the
disk (would affect reads and writes)

Yes, I’ve done WR’s before and made sure I had a bunch of take-some-space events. Also, I was doing `RDFL` using the same stream that I was doing `wrk` on ($ce-externalAccountTransfer), and seeing the 10-20x+ performance of TCP vs HTTP. I understand that TCP will have better performance in general, but on HTTP 2-3 MB/sec vs “saturating gigabit” as I’ve seen described in this mailing list is pretty different (and what we were expecting).

Oh, and similarly, if I `ab` on a single event (ab -k -c 20 -n 10000 -H "Accept: application/json" "http://10.130.25.204:2113/streams/\$ce-externalAccountTransfer/184016”) the very best I can see is about 4k reads/sec.

Presumably that eliminates anything that may be related to disk-related caching...

Over http it will be slower than TCP (it also has to do a ton more
work of things like serialization / formatting). 4k requests per
second is around where I would expect it to be. media type are you
reading with? json/xml will actually parse/reformat the event payload
iirc where as octal will not.

Yes, we use JSON.

I’ve seen statements suggesting we should expect much higher performance than this more than once:

https://groups.google.com/d/msg/event-store/NJ2stBRD1jA/l3Uif-tQLpIJ

with a load balancer in front (clients do not understand cluster) was hitting about 10000 writes per second and saturating gigabit for reads over the [HTTP] atom interface. This blew away any other provider.

Is this a misstatement, or has there been a performance regression since Jan 2014?

thats balanced to 3 nodes. there has been once change associated
though that may have some effect, when nodes forwarded it used to be
done over tcp but had to be moved to tcp for a few reasons.

That said why are you so worried about the http performance? In most
usecases you would want to put nginx etc in front of it as basically
every uri it generates (except for head) is infinitely cacheable doing
all the reformatting on every request is pretty wasteful.

Ok thanks. With the octet (ab -k -c 20 -n 10000 -H "Accept: application/octet-stream" "http://10.130.25.204:2113/streams/\$ce-externalAccountTransfer/184016”), I see pretty similar performance (~4k req/sec), and so much variance I can’t really be sure if it’s significant, so that doesn’t contribute much.

We’ve explored using Varnish as a countermeasure to this, and have tested it some. It adds some complexity and we were trying to understand the system better before adding more complexity. Additionally, a cache countermeasure like this also is susceptible to a scenario of: if we are depending on it for system usability and for some reason we lose the cache, the system is now faulted.

We’re looking at a setup like this, with haproxy doing url-based balancing to maximize caching:

    ——— Varnish -> ES
              /
-----haproxy—— Varnish -> ES
        \
                ———Varnish -> ES

Though there is even more complexity since we’re running multiple instances of ES on each node to try to simulate some kind of a data “namespace”.

We were understanding that once switching to hardware we would be able to get “saturate gigabit” levels of performance out of this adapter. We’re having pretty crippling levels of overall system performance, and trying to learn as much as we can about what’s contributing. Thus this research into the actual levels of HTTP performance. It seems we were off by quite a bit with what we could expect.

What are you doing that you have 4000 http reqs/sec? And with 3 nodes
you should be able to do this per node (on reads). It seems you are
measuring from one?

It would also be interesting to see the relationship between
connections / performance as 10 is relatively low.

On a side note I generally prefer wrk to ab

What are you doing that you have 4000 http reqs/sec?

Characterizing the HTTP performance. =)

And with 3 nodes you should be able to do this per node (on reads). It seems you are
measuring from one?

Correct, with the LB I get about 3x the performance.

It would also be interesting to see the relationship between
connections / performance as 10 is relatively low.

The tests were mostly done at 20+ connections.

The performance application issues we’re having revolve around either:

- on one of our system startups that ingests a bunch of streams (reads the $ce-account stream, for example) because it has somewhat primitive entity projections that don’t "snapshot" well (we’re transitioning to better read models - see next point)
- processing our event backlog as we bring online a better read model (the system picks up an event, projects several streams related to that event, and writes a database row)

On a side note I generally prefer wrk to ab

:+1: