EventStore persistent subscription performance issue

Hello,

We are seeing a pretty significant delay (sometime up to 14 seconds) in the events delivery for the persistent subscriptions even for a not a big load (~200 events/s).
We are running EventStore in the clustered mode (3 instances) in the Kubernetes (on-prem and AKS), subscribing to the categorized system projection and process each event concurrently. Delay is calculated by the subscriber as a time difference between Created timestamp in the event and the current date – without any load this delay is around 60-100 ms, but if there are a lot of events have been processed prior the load test (600k+) and we start generating ~200 events/s to the subscriber, delay increases exponentially. Event Store UI shows rate ~2000 events/s for the category projection

During the test event store log is full of the following warnings:

[00001,34,14:37:59.984] SLOW QUEUE MSG ["Projection Core #0"]: "UnwrapEnvelopeMessage" - 70ms. Q: 0/252.
[00001,135,14:38:00.222] SLOW QUEUE MSG ["StorageReaderQueue #3"]: "ReadStreamEventsForward" - 303ms. Q: 10/41.
[00001,34,14:38:00.503] SLOW BUS MSG ["bus"]: "UnwrapEnvelopeMessage" - 72ms. Handler: "UnwrapEnvelopeHandler".
[00001,34,14:38:00.503] SLOW QUEUE MSG ["Projection Core #0"]: "UnwrapEnvelopeMessage" - 72ms. Q: 0/253.

Could you please suggest what we might be doing wrong ?
Also do you have any best practices for the server configuration (number of threads, timeouts etc)?

Best regards,
Kirill

Current server configuration:

[00014,01,13:15:24.738] “OS:” Linux (Unix 4.15.0.1077)
[00014,01,13:15:24.762] “RUNTIME:” “5.16.0.220 (tarball Mon Nov 26 17:22:46 UTC 2018)” (64-bit)
[00014,01,13:15:24.764] “GC:” “2 GENERATIONS”
[00014,01,13:15:24.764] “LOGS:” “/var/log/eventstore”
MODIFIED OPTIONS:

    START STANDARD PROJECTIONS: true (Environment Variable)
    INT IP:                   10.244.0.96 (Environment Variable)
    EXT IP:                   10.244.0.96 (Environment Variable)
    INT HTTP PORT:            2114 (Environment Variable)
    EXT HTTP PORT:            2113 (Environment Variable)
    INT TCP PORT:             1112 (Environment Variable)
    EXT TCP PORT:             1113 (Environment Variable)
    INT TCP HEARTBEAT TIMEOUT: 10000 (Environment Variable)
    EXT TCP HEARTBEAT TIMEOUT: 10000 (Environment Variable)
    INT TCP HEARTBEAT INTERVAL: 5000 (Environment Variable)
    EXT TCP HEARTBEAT INTERVAL: 5000 (Environment Variable)
    CLUSTER SIZE:             1 (Environment Variable)
    CLUSTER DNS:              eventstore.default.svc.cluster.local (Environment Variable)
    CLUSTER GOSSIP PORT:      2114 (Environment Variable)
    RUN PROJECTIONS:          All (Environment Variable)
    INT HTTP PREFIXES:        http://*:2114/ (Environment Variable)
    EXT HTTP PREFIXES:        http://*:2113/ (Environment Variable)
    GOSSIP ALLOWED DIFFERENCE MS: 600000 (Environment Variable)
    ADD INTERFACE PREFIXES:   false (Config File)

DEFAULT OPTIONS:

    CONFIG:                   /etc/eventstore/eventstore.conf (<DEFAULT>)
    HELP:                     False (<DEFAULT>)
    VERSION:                  False (<DEFAULT>)
    LOG:                      /var/log/eventstore (<DEFAULT>)
    DEFINES:                  <empty> (<DEFAULT>)
    WHAT IF:                  False (<DEFAULT>)
    DISABLE HTTP CACHING:     False (<DEFAULT>)
    MONO MIN THREADPOOL SIZE: 10 (<DEFAULT>)
    INT SECURE TCP PORT:      0 (<DEFAULT>)
    EXT SECURE TCP PORT ADVERTISE AS: 0 (<DEFAULT>)
    EXT SECURE TCP PORT:      0 (<DEFAULT>)
    EXT IP ADVERTISE AS:      <empty> (<DEFAULT>)
    EXT TCP PORT ADVERTISE AS: 0 (<DEFAULT>)
    EXT HTTP PORT ADVERTISE AS: 0 (<DEFAULT>)
    INT IP ADVERTISE AS:      <empty> (<DEFAULT>)
    INT SECURE TCP PORT ADVERTISE AS: 0 (<DEFAULT>)
    INT TCP PORT ADVERTISE AS: 0 (<DEFAULT>)
    INT HTTP PORT ADVERTISE AS: 0 (<DEFAULT>)
    GOSSIP ON SINGLE NODE:    False (<DEFAULT>)
    CONNECTION PENDING SEND BYTES THRESHOLD: 10485760 (<DEFAULT>)
    CONNECTION QUEUE SIZE THRESHOLD: 50000 (<DEFAULT>)
    FORCE:                    False (<DEFAULT>)
    NODE PRIORITY:            0 (<DEFAULT>)
    MIN FLUSH DELAY MS:       2 (<DEFAULT>)
    COMMIT COUNT:             -1 (<DEFAULT>)
    PREPARE COUNT:            -1 (<DEFAULT>)
    ADMIN ON EXT:             True (<DEFAULT>)
    STATS ON EXT:             True (<DEFAULT>)
    GOSSIP ON EXT:            True (<DEFAULT>)
    DISABLE SCAVENGE MERGING: False (<DEFAULT>)
    SCAVENGE HISTORY MAX AGE: 30 (<DEFAULT>)
    DISCOVER VIA DNS:         True (<DEFAULT>)
    GOSSIP SEED:              <empty> (<DEFAULT>)
    STATS PERIOD SEC:         30 (<DEFAULT>)
    CACHED CHUNKS:            -1 (<DEFAULT>)
    READER THREADS COUNT:     4 (<DEFAULT>)
    CHUNKS CACHE SIZE:        536871424 (<DEFAULT>)
    MAX MEM TABLE SIZE:       1000000 (<DEFAULT>)
    HASH COLLISION READ LIMIT: 100 (<DEFAULT>)
    DB:                       /var/lib/eventstore (<DEFAULT>)
    INDEX:                    <empty> (<DEFAULT>)
    MEM DB:                   False (<DEFAULT>)
    SKIP DB VERIFY:           False (<DEFAULT>)
    WRITE THROUGH:            False (<DEFAULT>)
    UNBUFFERED:               False (<DEFAULT>)
    CHUNK INITIAL READER COUNT: 5 (<DEFAULT>)
    PROJECTION THREADS:       3 (<DEFAULT>)
    WORKER THREADS:           5 (<DEFAULT>)
    PROJECTIONS QUERY EXPIRY: 5 (<DEFAULT>)
    FAULT OUT OF ORDER PROJECTIONS: False (<DEFAULT>)
    ENABLE TRUSTED AUTH:      False (<DEFAULT>)
    CERTIFICATE STORE LOCATION: <empty> (<DEFAULT>)
    CERTIFICATE STORE NAME:   <empty> (<DEFAULT>)
    CERTIFICATE SUBJECT NAME: <empty> (<DEFAULT>)
    CERTIFICATE THUMBPRINT:   <empty> (<DEFAULT>)
    CERTIFICATE FILE:         <empty> (<DEFAULT>)
    CERTIFICATE PASSWORD:     <empty> (<DEFAULT>)
    USE INTERNAL SSL:         False (<DEFAULT>)
    DISABLE INSECURE TCP:     False (<DEFAULT>)
    SSL TARGET HOST:          n/a (<DEFAULT>)
    SSL VALIDATE SERVER:      True (<DEFAULT>)
    AUTHENTICATION TYPE:      internal (<DEFAULT>)
    AUTHENTICATION CONFIG:    <empty> (<DEFAULT>)
    DISABLE FIRST LEVEL HTTP AUTHORIZATION: False (<DEFAULT>)
    PREPARE TIMEOUT MS:       2000 (<DEFAULT>)
    COMMIT TIMEOUT MS:        2000 (<DEFAULT>)
    UNSAFE DISABLE FLUSH TO DISK: False (<DEFAULT>)
    BETTER ORDERING:          False (<DEFAULT>)
    UNSAFE IGNORE HARD DELETE: False (<DEFAULT>)
    SKIP INDEX VERIFY:        False (<DEFAULT>)
    INDEX CACHE DEPTH:        16 (<DEFAULT>)
    OPTIMIZE INDEX MERGE:     False (<DEFAULT>)
    GOSSIP INTERVAL MS:       1000 (<DEFAULT>)
    GOSSIP TIMEOUT MS:        500 (<DEFAULT>)
    ENABLE HISTOGRAMS:        False (<DEFAULT>)
    LOG HTTP REQUESTS:        False (<DEFAULT>)
    LOG FAILED AUTHENTICATION ATTEMPTS: False (<DEFAULT>)
    ALWAYS KEEP SCAVENGED:    False (<DEFAULT>)
    SKIP INDEX SCAN ON READS: False (<DEFAULT>)
    REDUCE FILE CACHE PRESSURE: False (<DEFAULT>)
    INITIALIZATION THREADS:   1 (<DEFAULT>)
    STRUCTURED LOG:           True (<DEFAULT>)
    MAX AUTO MERGE INDEX LEVEL: 2147483647 (<DEFAULT>)
    WRITE STATS TO DB:        True (<DEFAULT>)

This isn’t really going to help.

So the most likely thing to cause such a delay is series of retries. You have not included the configuration of the subscription or logs. It could also be a connection drop, slow handler, or 50 other things :slight_smile: The client can put out logs which will give greater detail (verbose logging), these would help in looking at things.