Eventstore getting excessively slow

rasmus · April 6, 2018, 9:29am

Hi,

we are seeing outputs like this:

SLOW QUEUE MSG [Worker #2]: AuthenticatedHttpRequestMessage - 1662ms. Q: 0/1.

We are using a single eventstore to drive the communication in our system (see configuration below), so this caused our application to become completely unresponsive. Restarting the eventstore allowed it to chew through its backlog and everything works now, several hours after the restart.

Obviously, we would like to avoid this in the future. What are ways to combat the slow queue? We are using projections, is that a potential issue? How can we get a handle on how many messages it is dealing with how fast, and what should we expect it to handle without problems?

Config part of log:

ES VERSION: 3.9.4.0 (HEAD/bbcfd2d092a76e8faac4837b1bcdda72b467713b, Wed, 26 Apr 2017 14:09:22 +0100)

OS: Linux (Unix 4.4.111.0)

RUNTIME: 3.12.1 (es-mono-3.12.1/4493dfd) (64-bit)

GC: 2 GENERATIONS

LOGS: /var/log/eventstore

MODIFIED OPTIONS:

INT IP: 10.0.19.37 (Environment Variable)

EXT IP: 0.0.0.0 (Environment Variable)

INT HTTP PORT: 2112 (Environment Variable)

EXT HTTP PORT: 2113 (Environment Variable)

CLUSTER SIZE: 1 (Environment Variable)

CLUSTER DNS: eventstore (Environment Variable)

CLUSTER GOSSIP PORT: 2112 (Environment Variable)

INT HTTP PREFIXES: http://*:2112/ (Environment Variable)

EXT HTTP PREFIXES: http://*:2113/ (Environment Variable)

GOSSIP ALLOWED DIFFERENCE MS: 600000 (Environment Variable)

ADD INTERFACE PREFIXES: false (Config File)

RUN PROJECTIONS: All (Config File)

DEFAULT OPTIONS:

CONFIG: /etc/eventstore/eventstore.conf ()

HELP: False ()

VERSION: False ()

LOG: /var/log/eventstore ()

DEFINES: ()

WHAT IF: False ()

START STANDARD PROJECTIONS: False ()

DISABLE HTTP CACHING: False ()

MONO MIN THREADPOOL SIZE: 10 ()

INT TCP PORT: 1112 ()

INT SECURE TCP PORT: 0 ()

EXT TCP PORT: 1113 ()

EXT SECURE TCP PORT ADVERTISE AS: 0 ()

EXT SECURE TCP PORT: 0 ()

EXT IP ADVERTISE AS: ()

EXT TCP PORT ADVERTISE AS: 0 ()

EXT HTTP PORT ADVERTISE AS: 0 ()

INT IP ADVERTISE AS: ()

INT SECURE TCP PORT ADVERTISE AS: 0 ()

INT TCP PORT ADVERTISE AS: 0 ()

INT HTTP PORT ADVERTISE AS: 0 ()

INT TCP HEARTBEAT TIMEOUT: 700 ()

EXT TCP HEARTBEAT TIMEOUT: 1000 ()

INT TCP HEARTBEAT INTERVAL: 700 ()

EXT TCP HEARTBEAT INTERVAL: 2000 ()

FORCE: False ()

NODE PRIORITY: 0 ()

MIN FLUSH DELAY MS: 2 ()

COMMIT COUNT: -1 ()

PREPARE COUNT: -1 ()

ADMIN ON EXT: True ()

STATS ON EXT: True ()

GOSSIP ON EXT: True ()

DISABLE SCAVENGE MERGING: False ()

SCAVENGE HISTORY MAX AGE: 30 ()

DISCOVER VIA DNS: True ()

GOSSIP SEED: ()

STATS PERIOD SEC: 30 ()

CACHED CHUNKS: -1 ()

READER THREADS COUNT: 4 ()

CHUNKS CACHE SIZE: 536871424 ()

MAX MEM TABLE SIZE: 1000000 ()

HASH COLLISION READ LIMIT: 100 ()

DB: /var/lib/eventstore ()

INDEX: ()

MEM DB: False ()

SKIP DB VERIFY: False ()

PROJECTION THREADS: 3 ()

WORKER THREADS: 5 ()

ENABLE TRUSTED AUTH: False ()

CERTIFICATE STORE LOCATION: ()

CERTIFICATE STORE NAME: ()

CERTIFICATE SUBJECT NAME: ()

CERTIFICATE THUMBPRINT: ()

CERTIFICATE FILE: ()

CERTIFICATE PASSWORD: ()

USE INTERNAL SSL: False ()

DISABLE INSECURE TCP: False ()

SSL TARGET HOST: n/a ()

SSL VALIDATE SERVER: True ()

AUTHENTICATION TYPE: internal ()

AUTHENTICATION CONFIG: ()

PREPARE TIMEOUT MS: 2000 ()

COMMIT TIMEOUT MS: 2000 ()

UNSAFE DISABLE FLUSH TO DISK: False ()

BETTER ORDERING: False ()

UNSAFE IGNORE HARD DELETE: False ()

INDEX CACHE DEPTH: 16 ()

GOSSIP INTERVAL MS: 1000 ()

GOSSIP TIMEOUT MS: 500 ()

ENABLE HISTOGRAMS: False ()

LOG HTTP REQUESTS: False ()

ALWAYS KEEP SCAVENGED: False ()

rasmus · April 6, 2018, 9:50am

I should perhaps mention, that we are only using the standard projections, by category and by type.

Austin_Salgat · April 10, 2018, 11:00pm

What hardware are you running on? This includes cpu, drive type, amount of RAM? The biggest bottlenecks for EventStore are first your drive type (since eventstore may do a lot of random access on the drive) followed by your memory (which uses memory mapping to cache as much of the database to memory as possible).

rasmus · April 11, 2018, 8:25am

Hi Austin, thank you for the reply.

CPU and RAM:

We are running on google cloud. The eventstore is assigned to a machine of type n1-standard-1, a “Standard machine type with 1 virtual CPU and 3.75 GB of memory”.

(https://cloud.google.com/compute/docs/machine-types, they have this footnote: For the n1 series of machine types, a virtual CPU is implemented as a single hardware hyper-thread on a 2.6 GHz Intel Xeon E5 (Sandy Bridge), 2.5 GHz Intel Xeon E5 v2 (Ivy Bridge), 2.3 GHz Intel Xeon E5 v3 (Haswell), 2.2 GHz Intel Xeon E5 v4 (Broadwell), or 2.0 GHz Intel Skylake (Skylake).)

We have tried with a n1-highcpu-4, a “High-CPU machine type with 4 virtual CPUs and 3.60 GB of memory”, which seems to help some.

Drive:

The eventstore is writing to a persistent volume of type pd-ssd. I believe that translates to a SSD persistent disk as described here: https://cloud.google.com/compute/docs/disks/#pdspecs

Austin_Salgat · April 12, 2018, 12:56am

My best suggestion is to increase your instance to one with more memory available. For us this upgrade represented one of our biggest jumps in performance and stability. I would start with n1-standard-2 since that doubles both your virtual cores (which helps with projections) and doubles your memory to give the OS a lot more caching ability (I would even consider n1-highmem-2).

Greg_Young1 · April 12, 2018, 5:59am

Running with networked disks will always run into occasional issues (no matter which platform you are on, tcp sockets occasionally break etc). How often are you seeing this? Also running on a larger node would likely solve some issues.

rasmus · April 13, 2018, 3:01pm

Thanks both of you, running on a bigger node does indeed seem to alleviate the SLOW messages. However, messages are still not coming through, it looks like perhaps subscriptions have stopped, will open a separate ticket on this.