"---! VERY SLOW BUS MSG" and "DeadlineExceeded"

Hi community!

I would like to get some help from the experts here. So we have a 5 node cluster deployed in GCP (no K8S, just plain old VM instances)

The cluster nodes each run on 16 vCPUs and 16GB E2 machines.

Anyway, we have a client application (written in golang, using the official ESDB client) which is importing a moderately large amount of events into ESDB. This client is sending the events with about 500RPS.

What we noticed is that after a few hours, the leader node starts using a fairly large and unexpected size of the allocated memory, which leads to a few symptoms:

  • The client starts receiving
    [ErrorCodeDeadlineExceeded] gRPC deadline exceeded error: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
    errors
  • The server starts logging a few
    SLOW BUS MSG ["bus"]: "CommittedEventDistributed" - 30537ms. Handler: "EventReaderCoreService".
    errors, then later even a few
    ---!!! VERY SLOW BUS MSG ["bus"]: "CommittedEventDistributed" - 30537ms. Handler: "EventReaderCoreService".
    errors and then the nodes start the re-election process and the leader reboots itself

Also, we use ESDB to run a user-defined projection. We have millions of streams, which all send their events to this projection, which after completing it’s job emitting another event to result streams (so the input and output streams have a one-to-one correspondence). We also have 3 golang persistent subscriptions, which are connecting to the $all stream via a persistent group, and listening to the result streams by filtering the $all stream.

We noticed that when the re-election completes, and the leader re-boots itself, the projection also gets to reset.

Has anyone seen such behavior before? What can cause this? Not enough CPU power? Not enough disk IO throughput? Memory leak in ESDB?

In any case, here is a sample of the server logs (some messages are omitted):

199eunvf2ug8uo,"Nov 21 06:01:33 node3 bash[148776]: [148776,43,06:01:33.512,DBG] SLOW BUS MSG [""bus""]: ""UnwrapEnvelopeMessage"" - 77ms. Handler: ""UnwrapEnvelopeHandler"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8vo,"Nov 21 06:01:33 node3 bash[148776]: [148776,19,06:01:33.982,INF] CLUSTER HAS CHANGED ""Elections Done""",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8vp,Nov 21 06:01:33 node3 bash[148776]: Old:,node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8vq,"Nov 21 06:01:33 node3 bash[148776]: [""Priority: 0 VND {4a89ba97-fd00-440d-9928-fc66477ea5a6} <LIVE> [Follower, Unspecified/node4.esdb.prod.******.**********:1112, n/a, Unspecified/10.0.2.16:1113, n/a, Unspecified/10.0.2.16:2113, (ADVERTISED: HTTP::0, TCP::0), Version: 23.10.1.0] 104521893314/104521893839/104521893839/E154@104505008225:{28fb022a-737c-4546-b7dc-326ffedfaac6} | 2024-11-21 06:01:33.664"", ""Priority: 0 VND {7de5dff4-4b41-43d2-bd94-a85fe708535d} <LIVE> [Leader, Unspecified/node1.esdb.prod.******.**********:1112, n/a, Unspecified/10.0.2.15:1113, n/a, Unspecified/10.0.2.15:2113, (ADVERTISED: HTTP::0, TCP::0), Version: 23.10.1.0] 104522178323/104522198268/104522184998/E154@104505008225:{28fb022a-737c-4546-b7dc-326ffedfaac6} | 2024-11-21 06:01:33.958"", ""Priority: 0 VND {74264931-050d-4154-a70e-a11fe3b3c2e1} <LIVE> [Follower, Unspecified/node5.esdb.prod.******.**********:1112, n/a, Unspecified/10.0.2.14:1113, n/a, Unspecified/10.0.2.14:2113, (ADVERTISED: HTTP::0, TCP::0), Version: 23.10.1.0] 104518260444/104518264047/104518264047/E154@104505008225:{28fb022a-737c-4546-b7dc-326ffedfaac6} | 2024-11-21 06:01:32.405"", ""Priority: 0 VND {943a9344-f475-4261-93ab-657fa1b80236} <LIVE> [Leader, Unspecified/node3.esdb.prod.******.**********:1112, n/a, Unspecified/10.0.2.13:1113, n/a, Unspecified/10.0.2.13:2113, (ADVERTISED: HTTP::0, TCP::0), Version: 23.10.1.0] 104505007704/104505065009/104505065009/E152@98200119748:{bb7ec13f-2ff8-41eb-a6ce-85a8e80da6e0} | 2024-11-21 06:01:33.971"", ""Priority: 0 VND {739b64ea-15eb-4828-980a-e9f77a76525f} <LIVE> [Follower, Unspecified/node2.esdb.prod.******.**********:1112, n/a, Unspecified/10.0.2.12:1113, n/a, Unspecified/10.0.2.12:2113, (ADVERTISED: HTTP::0, TCP::0), Version: 23.10.1.0] 104517010901/104517011947/104517011947/E154@104505008225:{28fb022a-737c-4546-b7dc-326ffedfaac6} | 2024-11-21 06:01:30.420""]",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8vr,Nov 21 06:01:33 node3 bash[148776]: New:,node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8vs,"Nov 21 06:01:33 node3 bash[148776]: [""Priority: 0 VND {4a89ba97-fd00-440d-9928-fc66477ea5a6} <LIVE> [Unknown, Unspecified/node4.esdb.prod.******.**********:1112, n/a, Unspecified/10.0.2.16:1113, n/a, Unspecified/10.0.2.16:2113, (ADVERTISED: HTTP::0, TCP::0), Version: 23.10.1.0] 104521893314/104521893839/104521893839/E154@104505008225:{28fb022a-737c-4546-b7dc-326ffedfaac6} | 2024-11-21 06:01:33.982"", ""Priority: 0 VND {7de5dff4-4b41-43d2-bd94-a85fe708535d} <LIVE> [Leader, Unspecified/node1.esdb.prod.******.**********:1112, n/a, Unspecified/10.0.2.15:1113, n/a, Unspecified/10.0.2.15:2113, (ADVERTISED: HTTP::0, TCP::0), Version: 23.10.1.0] 104522178323/104522198268/104522184998/E154@104505008225:{28fb022a-737c-4546-b7dc-326ffedfaac6} | 2024-11-21 06:01:33.982"", ""Priority: 0 VND {74264931-050d-4154-a70e-a11fe3b3c2e1} <LIVE> [Unknown, Unspecified/node5.esdb.prod.******.**********:1112, n/a, Unspecified/10.0.2.14:1113, n/a, Unspecified/10.0.2.14:2113, (ADVERTISED: HTTP::0, TCP::0), Version: 23.10.1.0] 104518260444/104518264047/104518264047/E154@104505008225:{28fb022a-737c-4546-b7dc-326ffedfaac6} | 2024-11-21 06:01:33.982"", ""Priority: 0 VND {943a9344-f475-4261-93ab-657fa1b80236} <LIVE> [Unknown, Unspecified/node3.esdb.prod.******.**********:1112, n/a, Unspecified/10.0.2.13:1113, n/a, Unspecified/10.0.2.13:2113, (ADVERTISED: HTTP::0, TCP::0), Version: 23.10.1.0] 104505007704/104505065009/104505065009/E152@98200119748:{bb7ec13f-2ff8-41eb-a6ce-85a8e80da6e0} | 2024-11-21 06:01:33.982"", ""Priority: 0 VND {739b64ea-15eb-4828-980a-e9f77a76525f} <LIVE> [Unknown, Unspecified/node2.esdb.prod.******.**********:1112, n/a, Unspecified/10.0.2.12:1113, n/a, Unspecified/10.0.2.12:2113, (ADVERTISED: HTTP::0, TCP::0), Version: 23.10.1.0] 104517010901/104517011947/104517011947/E154@104505008225:{28fb022a-737c-4546-b7dc-326ffedfaac6} | 2024-11-21 06:01:33.982""]",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8vt,"Nov 21 06:01:33 node3 bash[148776]: [148776,19,06:01:33.982,INF] ========== [""10.0.2.13:2113""] PRE-REPLICA STATE, WAITING FOR CHASER TO CATCH UP... LEADER IS [""Unspecified/10.0.2.15:2113"",{7de5dff4-4b41-43d2-bd94-a85fe708535d}]",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8vu,"Nov 21 06:01:33 node3 bash[148776]: [148776,37,06:01:33.992,INF] Failed to write events to stream ""$et-replacement"". Error: ""CommitTimeout""",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8vv,"Nov 21 06:01:33 node3 bash[148776]: [148776,39,06:01:33.992,INF] Failed to write events to stream ""**************e2eResult-01932dd6-abaf-795e-9516-db0cecb0e13a"". Error: ""CommitTimeout""",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8vw,"Nov 21 06:01:33 node3 bash[148776]: [148776,26,06:01:33.992,INF] Failed to write events to stream ""$ce-**************e2eInput"". Error: ""CommitTimeout""",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8vx,"Nov 21 06:01:33 node3 bash[148776]: [148776, 8,06:01:33.992,INF] Failed to write projection checkpoint to stream ""$projections-$streams-checkpoint"". Error: ""CommitTimeout""",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8vy,"Nov 21 06:01:33 node3 bash[148776]: [148776,29,06:01:33.992,INF] Failed to write projection checkpoint to stream ""$projections-$stream_by_category-checkpoint"". Error: ""CommitTimeout""",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8vz,"Nov 21 06:01:33 node3 bash[148776]: [148776,83,06:01:33.992,DBG] Error writing checkpoint for ""$persistentsubscription-$all::******-subscriber-group-checkpoint"": CommitTimeout",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8w0,"Nov 21 06:01:33 node3 bash[148776]: [148776,83,06:01:33.992,DBG] Persistent subscriptions received state change to PreReplica. Stopping listening",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8w7,"Nov 21 06:01:33 node3 bash[148776]: [148776,39,06:01:33.993,INF] Failed to write events to stream ""**************e2eResult-01932dd6-ee15-7ec2-8ec9-8c18b12ac45d"". Error: ""CommitTimeout""",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8w8,"Nov 21 06:01:33 node3 bash[148776]: [148776,26,06:01:33.993,INF] Failed to write events to stream ""$ce-**************e2eResult"". Error: ""CommitTimeout""",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8yy,"Nov 21 06:01:34 node3 bash[148776]: [148776,83,06:01:33.995,DBG] Persistent Subscriptions have been stopped.",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8yz,"Nov 21 06:01:34 node3 bash[148776]: [148776,28,06:01:34.000,INF] PROJECTIONS SUBSYSTEM: Stopping components for Instance: 9c1badae-8117-4149-a2e3-d5c189111b28",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8z5,"Nov 21 06:01:34 node3 bash[148776]: [148776,19,06:01:34.001,INF] ELECTIONS: (V=81) ACCEPT FROM [""Unspecified/10.0.2.14:2113"",{74264931-050d-4154-a70e-a11fe3b3c2e1}] M=[""Unspecified/10.0.2.15:2113"",{7de5dff4-4b41-43d2-bd94-a85fe708535d}]).",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8z6,"Nov 21 06:01:34 node3 bash[148776]: [148776,19,06:01:34.001,INF] ""InaugurationManager"" in state (PreReplica, Idle): IGNORING ""ChaserCaughtUp"". Not waiting for chaser.",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8z7,"Nov 21 06:01:34 node3 bash[148776]: [148776,19,06:01:34.001,INF] Subscribing at LogPosition: 104505065009 (0x1854FCBE31) to LEADER [""Unspecified/node1.esdb.prod.******.**********:1112"", {7de5dff4-4b41-43d2-bd94-a85fe708535d}] as replica with SubscriptionId: {f7ce7a11-ccad-4534-b741-31e46ccaa63f}, ConnectionId: {520e5f07-1049-437e-b877-c0d261639a3d}, LocalEndPoint: [""10.0.2.13:47464""], Epochs:",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8z8,"Nov 21 06:01:34 node3 bash[148776]: ""E152@98200119748:{bb7ec13f-2ff8-41eb-a6ce-85a8e80da6e0}\nE151@98199052087:{6376105e-a1a7-49f5-a9ba-551ed770d68d}\nE150@98199051413:{18894467-59f8-4e94-967b-919d578cf0ff}\nE149@98199050739:{a05afdcd-c41c-4b0d-af53-08bb21e51b88}\nE148@98199050065:{17bc31fa-9ed0-48e7-a4bb-d2af20e4296c}\nE147@98199049391:{4877e2b9-6cb7-49ea-90da-bb6f0f41fa83}\nE146@98192737876:{8682b5db-be04-43b9-8cc0-78de7129cd01}\nE144@90638199426:{31e214f7-ff68-4da4-9bf8-575f2b9e2b13}\nE143@90637865606:{12e5a554-b5ee-463b-aa21-01413596ddfd}\nE142@90637515215:{70cd4020-1d3f-4f01-9b7a-2f58513ae4c9}\nE141@90624286505:{f63e9a78-04cd-42df-8656-652aa5c98ba3}\nE136@84686393395:{1b848f60-ddfe-4e37-aa07-75b2ecd2f0a0}\nE135@84686220958:{460ea9c2-c57f-4c45-ae68-fa75dc1be371}\nE134@84686206920:{4b52a50c-dbea-4445-a778-81426e29c0af}\nE133@76797811139:{73bc6f65-7e68-48de-b035-9fb3fa5fd2f0}\nE132@76797810274:{940dca7f-8cfc-4322-85d4-45acdd987131}\nE129@75274979757:{da2fecb6-9180-4ad0-881f-de73ef0d2308}\nE128@75268727523:{83915836-f4d0-4c72-812c-69b7568cf464}\nE127@67575716195:{113636be-6d1a-4ae4-8cc6-8f19f00e0d4f}\nE126@67573703794:{abbf2e19-f45e-47dc-ad0a-5423cc557aee}""...",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
199eunvf2ug8z9,Nov 21 06:01:34 node3 bash[148776]: .,node3,projects/******-****/logs/syslog,,2024-11-21T06:01:34.069042838Z
1dxz9p5f7ig8nr,"Nov 21 06:01:01 node3 bash[148776]: [148776,37,06:01:01.325,DBG] SLOW BUS MSG [""bus""]: ""UnwrapEnvelopeMessage"" - 50ms. Handler: ""UnwrapEnvelopeHandler"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:02.055965375Z
1dxz9p5f7ig8ns,"Nov 21 06:01:01 node3 bash[148776]: [148776,26,06:01:01.377,DBG] SLOW BUS MSG [""bus""]: ""UnwrapEnvelopeMessage"" - 48ms. Handler: ""UnwrapEnvelopeHandler"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:02.055965375Z
1dxz9p5f7ig8nt,"Nov 21 06:01:01 node3 bash[148776]: [148776,43,06:01:01.379,DBG] SLOW BUS MSG [""bus""]: ""UnwrapEnvelopeMessage"" - 50ms. Handler: ""UnwrapEnvelopeHandler"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:02.055965375Z
1dxz9p5f7ig8nu,"Nov 21 06:01:01 node3 bash[148776]: [148776,44,06:01:01.516,DBG] SLOW BUS MSG [""bus""]: ""UnwrapEnvelopeMessage"" - 53ms. Handler: ""UnwrapEnvelopeHandler"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:02.055965375Z
1dxz9p5f7ig8nv,"Nov 21 06:01:01 node3 bash[148776]: [148776,47,06:01:01.517,DBG] SLOW BUS MSG [""bus""]: ""UnwrapEnvelopeMessage"" - 51ms. Handler: ""UnwrapEnvelopeHandler"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:02.055965375Z
1dxz9p5f7ig8nw,"Nov 21 06:01:01 node3 bash[148776]: [148776,42,06:01:01.524,DBG] SLOW BUS MSG [""bus""]: ""UnwrapEnvelopeMessage"" - 50ms. Handler: ""UnwrapEnvelopeHandler"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:02.055965375Z
1dz8r0af7kop1b,"Nov 21 06:01:32 node3 bash[148776]: [148776,46,06:01:32.329,DBG] SLOW BUS MSG [""bus""]: ""CommittedEventDistributed"" - 30536ms. Handler: ""EventReaderCoreService"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1c,"Nov 21 06:01:32 node3 bash[148776]: [148776,28,06:01:32.329,DBG] SLOW BUS MSG [""manager input bus""]: ""RegularTimeout"" - 30536ms. Handler: ""ProjectionCoreCoordinator"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1d,"Nov 21 06:01:32 node3 bash[148776]: [148776,46,06:01:32.329,ERR] ---!!! VERY SLOW BUS MSG [""bus""]: ""CommittedEventDistributed"" - 30536ms. Handler: ""EventReaderCoreService"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1e,"Nov 21 06:01:32 node3 bash[148776]: [148776,42,06:01:32.329,DBG] SLOW BUS MSG [""bus""]: ""UnwrapEnvelopeMessage"" - 30539ms. Handler: ""UnwrapEnvelopeHandler"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1f,"Nov 21 06:01:32 node3 bash[148776]: [148776,42,06:01:32.329,ERR] ---!!! VERY SLOW BUS MSG [""bus""]: ""UnwrapEnvelopeMessage"" - 30539ms. Handler: ""UnwrapEnvelopeHandler"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1g,"Nov 21 06:01:32 node3 bash[148776]: [148776,26,06:01:32.329,DBG] SLOW BUS MSG [""bus""]: ""CommittedEventDistributed"" - 30537ms. Handler: ""EventReaderCoreService"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1h,"Nov 21 06:01:32 node3 bash[148776]: [148776,26,06:01:32.329,ERR] ---!!! VERY SLOW BUS MSG [""bus""]: ""CommittedEventDistributed"" - 30537ms. Handler: ""EventReaderCoreService"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1i,"Nov 21 06:01:32 node3 bash[148776]: [148776,28,06:01:32.329,ERR] ---!!! VERY SLOW BUS MSG [""manager input bus""]: ""RegularTimeout"" - 30536ms. Handler: ""ProjectionCoreCoordinator"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1j,"Nov 21 06:01:32 node3 bash[148776]: [148776,85,06:01:32.329,DBG] HTTP Request Received",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1k,Nov 21 06:01:32 node3 bash[148776]: 11/21/2024 06:01:32,node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1l,"Nov 21 06:01:32 node3 bash[148776]: From: ""10.0.2.15:39526""",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1m,"Nov 21 06:01:32 node3 bash[148776]: ""GET"" http://10.0.2.13:2113/projection/**************Projectione2e/statistics",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1n,"Nov 21 06:01:32 node3 bash[148776]: {[""Host""]=[""10.0.2.13:2113""], [""User-Agent""]=[""Go-http-client/1.1""], [""Accept-Encoding""]=[""gzip""], [""traceparent""]=[""00-88bbb19cfd3482e40819a2bd869a1a40-36e91bb0f8c9cfd2-00""], [""X-Forwarded-Host""]=[""cluster.esdb.prod.******.**********:2113""]}",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1o,"Nov 21 06:01:32 node3 bash[148776]: """"",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1p,"Nov 21 06:01:32 node3 bash[148776]: [148776,48,06:01:32.331,DBG] SLOW BUS MSG [""bus""]: ""UnwrapEnvelopeMessage"" - 30539ms. Handler: ""UnwrapEnvelopeHandler"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1q,"Nov 21 06:01:32 node3 bash[148776]: [148776,48,06:01:32.331,ERR] ---!!! VERY SLOW BUS MSG [""bus""]: ""UnwrapEnvelopeMessage"" - 30539ms. Handler: ""UnwrapEnvelopeHandler"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1r,Nov 21 06:01:32 node3 rsyslogd: action 'action-8-builtin:omfile' resumed (module 'builtin:omfile') [v8.2112.0 try https://www.rsyslog.com/e/2359 ],node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1s,"Nov 21 06:01:32 node3 bash[148776]: [148776,29,06:01:32.331,DBG] SLOW BUS MSG [""bus""]: ""UnwrapEnvelopeMessage"" - 30539ms. Handler: ""UnwrapEnvelopeHandler"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop1t,"Nov 21 06:01:32 node3 bash[148776]: [148776,29,06:01:32.331,ERR] ---!!! VERY SLOW BUS MSG [""bus""]: ""UnwrapEnvelopeMessage"" - 30539ms. Handler: ""UnwrapEnvelopeHandler"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop26,"Nov 21 06:01:32 node3 bash[148776]: [148776,51,06:01:32.333,INF] Connection '""internal-normal""""""' [""10.0.2.16:41252"", {b82e349f-e6e2-4ff1-8a87-6983e0c00689}] closed: Success.",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop27,"Nov 21 06:01:32 node3 bash[148776]: [148776,62,06:01:32.333,INF] Connection '""internal-normal""""""' [""10.0.2.14:35436"", {fb24af04-d1d4-4b50-b28a-dbbbc724ed04}] closed: Success.",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop28,"Nov 21 06:01:32 node3 bash[148776]: [148776,30,06:01:32.333,INF] Connection '""internal-normal""""""' [""10.0.2.12:42312"", {5db8aebc-2fef-497d-86c7-0b09c5041e53}] closed: Success.",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2b,"Nov 21 06:01:32 node3 bash[148776]: [148776,107,06:01:32.333,INF] ES ""TcpConnection"" closed [06:01:32.333: N""10.0.2.15:43270"", L""10.0.2.13:1112"", {8835dd13-f844-41c0-9d5d-67d539335ac0}]:Receive calls: 1145705, callbacks: 1145705",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2c,"Nov 21 06:01:32 node3 bash[148776]: [148776,107,06:01:32.333,INF] ES ""TcpConnection"" closed [06:01:32.333: N""10.0.2.15:43270"", L""10.0.2.13:1112"", {8835dd13-f844-41c0-9d5d-67d539335ac0}]:Close reason: [Success] ""Socket closed""",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2d,"Nov 21 06:01:32 node3 rsyslogd: action 'action-8-builtin:omfile' suspended (module 'builtin:omfile'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2112.0 try https://www.rsyslog.com/e/2007 ]",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2e,"Nov 21 06:01:32 node3 bash[148776]: [148776,107,06:01:32.333,INF] Connection '""internal-normal""""""' [""10.0.2.15:43270"", {8835dd13-f844-41c0-9d5d-67d539335ac0}] closed: Success.",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2f,"Nov 21 06:01:32 node3 bash[148776]: [148776,19,06:01:32.333,INF] Looks like node [""Unspecified/node4.esdb.prod.******.**********:1112""] is DEAD (TCP connection lost). Issuing a gossip to confirm.",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2g,"Nov 21 06:01:32 node3 bash[148776]: [148776,30,06:01:32.334,DBG] HTTP Request Received",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2h,Nov 21 06:01:32 node3 bash[148776]: 11/21/2024 06:01:32,node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2i,"Nov 21 06:01:32 node3 bash[148776]: From: ""10.0.1.8:54735""",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2j,"Nov 21 06:01:32 node3 bash[148776]: ""GET"" http://node3.esdb.prod.******.**********:2113/projections/all-non-transient",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2k,"Nov 21 06:01:32 node3 bash[148776]: {[""Accept""]=[""application/json""], [""Connection""]=[""keep-alive""], [""Host""]=[""node3.esdb.prod.******.**********:2113""], [""User-Agent""]=[""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36""], [""Accept-Encoding""]=[""gzip, deflate""], [""Accept-Language""]=[""en,en-US;q=0.9,ja;q=0.8""], [""Referer""]=[""http://node3.esdb.prod.******.**********:2113/web/index.html""], [""ES-LongPoll""]=[""5""]}",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2l,"Nov 21 06:01:32 node3 bash[148776]: """"",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2m,"Nov 21 06:01:32 node3 bash[148776]: [148776,81,06:01:32.334,INF] ES ""TcpConnection"" closed [06:01:32.334: N""10.0.2.14:52410"", L""10.0.2.13:1112"", {c54ab3ce-ff59-43a8-9b14-472dcec5310a}]:Received bytes: 2197, Sent bytes: 0",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2q,"Nov 21 06:01:32 node3 bash[148776]: [148776,81,06:01:32.334,INF] Internal TCP connection accepted: [Normal, ""10.0.2.14:52410"", L""10.0.2.13:1112"", {c54ab3ce-ff59-43a8-9b14-472dcec5310a}].",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2r,"Nov 21 06:01:32 node3 bash[148776]: [148776,18,06:01:32.336,DBG] Closing connection '""internal-normal""""""' [""10.0.2.16:41252"", L""10.0.2.13:1112"", {b82e349f-e6e2-4ff1-8a87-6983e0c00689}] cleanly."" Reason: Closing replication subscription connection.""",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2s,Nov 21 06:01:32 node3 rsyslogd: action 'action-8-builtin:omfile' resumed (module 'builtin:omfile') [v8.2112.0 try https://www.rsyslog.com/e/2359 ],node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2t,"Nov 21 06:01:32 node3 bash[148776]: [148776,18,06:01:32.336,DBG] Closing connection '""internal-normal""""""' [""10.0.2.12:42312"", L""10.0.2.13:1112"", {5db8aebc-2fef-497d-86c7-0b09c5041e53}] cleanly."" Reason: Closing replication subscription connection.""",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2w,"Nov 21 06:01:32 node3 bash[148776]: [148776,19,06:01:32.336,INF] Looks like node [""Unspecified/node2.esdb.prod.******.**********:1112""] is DEAD (TCP connection lost). Issuing a gossip to confirm.",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2x,"Nov 21 06:01:32 node3 bash[148776]: [148776,19,06:01:32.336,INF] Looks like node [""Unspecified/node5.esdb.prod.******.**********:1112""] is DEAD (TCP connection lost). Issuing a gossip to confirm.",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop2y,"Nov 21 06:01:32 node3 bash[148776]: [148776,19,06:01:32.336,INF] Looks like node [""Unspecified/node1.esdb.prod.******.**********:1112""] is DEAD (TCP connection lost). Issuing a gossip to confirm.",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop5u,"Nov 21 06:01:32 node3 bash[148776]: [148776, 8,06:01:32.514,DBG] SLOW BUS MSG [""bus""]: ""CommittedEventDistributed"" - 57ms. Handler: ""EventReaderCoreService"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop5v,"Nov 21 06:01:32 node3 bash[148776]: [148776,26,06:01:32.514,DBG] SLOW BUS MSG [""bus""]: ""CommittedEventDistributed"" - 57ms. Handler: ""EventReaderCoreService"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop5w,"Nov 21 06:01:32 node3 bash[148776]: [148776,91,06:01:32.515,DBG] SLOW BUS MSG [""PersistentSubscriptionsBus""]: ""UnsubscribeFromStream"" - 111ms. Handler: ""PersistentSubscriptionService`1"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop5x,"Nov 21 06:01:32 node3 bash[148776]: [148776,40,06:01:32.515,DBG] SLOW BUS MSG [""bus""]: ""UnwrapEnvelopeMessage"" - 60ms. Handler: ""UnwrapEnvelopeHandler"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop5y,"Nov 21 06:01:32 node3 bash[148776]: [148776,46,06:01:32.516,DBG] SLOW BUS MSG [""bus""]: ""UnwrapEnvelopeMessage"" - 60ms. Handler: ""UnwrapEnvelopeHandler"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop5z,"Nov 21 06:01:32 node3 bash[148776]: [148776,29,06:01:32.517,DBG] SLOW BUS MSG [""bus""]: ""UnwrapEnvelopeMessage"" - 60ms. Handler: ""UnwrapEnvelopeHandler"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop60,"Nov 21 06:01:32 node3 bash[148776]: [148776,91,06:01:32.519,DBG] Retrying message ""$all::******-subscriber-group"" ""**************e2eResult-01932dd6-bea4-753f-843c-aa5a3413bf3b""/1 (stream position=""C:104504975584/P:104504975584"")",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop61,"Nov 21 06:01:32 node3 bash[148776]: [148776,91,06:01:32.519,DBG] Retrying message ""$all::******-subscriber-group"" ""**************e2eResult-01932dd6-bea4-753f-843c-aa5a3413bf3b""/1 (stream position=""C:104504975584/P:104504975584"")",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop62,"Nov 21 06:01:32 node3 bash[148776]: [148776,91,06:01:32.521,DBG] New connection to persistent subscription ""$all::******-subscriber-group"" by e56db625-8041-4de5-a3e4-713498783cda",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop63,"Nov 21 06:01:32 node3 bash[148776]: [148776,91,06:01:32.572,DBG] SLOW BUS MSG [""PersistentSubscriptionsBus""]: ""ConnectToPersistentSubscriptionToAll"" - 53ms. Handler: ""PersistentSubscriptionService`1"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop64,"Nov 21 06:01:32 node3 bash[148776]: [148776,91,06:01:32.572,DBG] New connection to persistent subscription ""$all::******-subscriber-group"" by 7c11437b-5e8e-44ef-8b44-ec9ea92e02b0",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop65,"Nov 21 06:01:32 node3 bash[148776]: [148776,91,06:01:32.573,DBG] New connection to persistent subscription ""$all::******-subscriber-group"" by 0e0dcc89-0516-4e43-8469-b22f8e7ed5a5",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop66,"Nov 21 06:01:32 node3 bash[148776]: [148776,71,06:01:32.605,DBG] Read Stream Events Backward operation has expired for Stream: ""**************e2eResult-01932dd6-edec-789a-b578-4926b1739c7e"", From Event Number: -1, Max Count: 1. Operation Expired at 11/21/2024 06:01:11",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop67,"Nov 21 06:01:32 node3 bash[148776]: [148776,71,06:01:32.605,DBG] Read All Stream Events Forward operation has expired for C:104501508953/P:0. Operation Expired at 11/21/2024 06:01:11",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop68,"Nov 21 06:01:32 node3 bash[148776]: [148776,71,06:01:32.605,DBG] Read Stream Events Backward operation has expired for Stream: ""**************e2eResult-01932dd6-edec-75dc-b02c-67a083fbafe9"", From Event Number: -1, Max Count: 1. Operation Expired at 11/21/2024 06:01:11",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop69,"Nov 21 06:01:32 node3 bash[148776]: [148776,71,06:01:32.605,DBG] Read Stream Events Backward operation has expired for Stream: ""**************e2eResult-01932dd6-ee14-7792-afd8-03f36b82f9b9"", From Event Number: -1, Max Count: 1. Operation Expired at 11/21/2024 06:01:11",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop6a,"Nov 21 06:01:32 node3 bash[148776]: [148776,71,06:01:32.605,DBG] Read All Stream Events Forward operation has expired for C:104502608695/P:0. Operation Expired at 11/21/2024 06:01:11",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop6b,"Nov 21 06:01:32 node3 bash[148776]: [148776,71,06:01:32.605,DBG] Read All Stream Events Forward operation has expired for C:104500633199/P:0. Operation Expired at 11/21/2024 06:01:11",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dz8r0af7kop6c,"Nov 21 06:01:32 node3 bash[148776]: [148776,71,06:01:32.605,DBG] Read All Stream Events Forward operation has expired for C:104503603956/P:0. Operation Expired at 11/21/2024 06:01:11",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:33.069564917Z
1dzkesbf7l1prt,"Nov 21 06:01:00 node3 bash[148776]: [148776,37,06:01:00.915,DBG] SLOW BUS MSG [""bus""]: ""UnwrapEnvelopeMessage"" - 49ms. Handler: ""UnwrapEnvelopeHandler"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:01.054265811Z
1dzpwzcf7klzp6,"Nov 21 06:01:34 node3 bash[148776]: [148776,37,06:01:34.079,DBG] SLOW BUS MSG [""bus""]: ""CommittedEventDistributed"" - 77ms. Handler: ""EventReaderCoreService"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:35.065485288Z
1dzpwzcf7klzp7,"Nov 21 06:01:34 node3 bash[148776]: [148776,28,06:01:34.080,DBG] PROJECTIONS: Stopping Projections Core Coordinator",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:35.065485288Z
1dzpwzcf7klzp8,"Nov 21 06:01:34 node3 bash[148776]: [148776,28,06:01:34.080,DBG] SLOW BUS MSG [""manager input bus""]: ""StopComponents"" - 79ms. Handler: ""ProjectionCoreCoordinator"".",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:35.065485288Z
1dzpwzcf7klzp9,"Nov 21 06:01:34 node3 bash[148776]: [148776,28,06:01:34.080,DBG] PROJECTIONS: Stopping Projections Manager. Correlation 9c1badae-8117-4149-a2e3-d5c189111b28",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:35.065485288Z
1dzpwzcf7klzpa,"Nov 21 06:01:34 node3 bash[148776]: [148776,19,06:01:34.083,INF] Offline truncation will happen, shutting down ""ReplicationTrackingService""",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:35.065485288Z
1dzpwzcf7klzpb,"Nov 21 06:01:34 node3 bash[148776]: [148776,15,06:01:34.083,INF] === SUBSCRIBED to [""Unspecified/node1.esdb.prod.******.**********:1112"",{7de5dff4-4b41-43d2-bd94-a85fe708535d}] at 104505008225 (0x1854FBE061). SubscriptionId: {f7ce7a11-ccad-4534-b741-31e46ccaa63f}.",node3,projects/******-****/logs/syslog,,2024-11-21T06:01:35.065485288Z

And here are the monitoring dashboards during the failures happening:


Some help from the community would be much appreciated!

Those very slow Bus Msg seems to indicate Disk saturation :
either IOPS or throughput .
I’d recommend looking at the Queue Disk Legnth of disk and the specific threshold exceed metrics in the cloud you use ( e.G on AWS

Thank you @yves.lorphelin !

Yes, we are probably on the same page on that (or I might have misunderstood what you wrote?)

Our devops team member came to the conclusion that the disk queue depth is too low (it hardly reaches 2, which roughly translates to 250-300 IOPS). So even with 500RPS of events appended by the golang client and 130K event/sec processed items reported by ESDB in the logs, the disk is only operating with a snail’s pace :snail: (in theory the disks in the cluster are able to handle 30000 IOPS)

We increased the ESDB read thread count in the config to be double of the CPU core count. But the same story: the leader node still runs out of memory.

Is there anything else in ESDB configs that we can adjust to increase the IOPS and the disk queue length? Or is there any known operating system level (on Linux) setting that is known to cause issues?

a high disk queue length is generally not good :
it means the processes have to wait before their write / read operations are executed

I’m not too familiar with GCP ,I’d take a look at those metrics and compare to what the provisioned disk offer.

In term of DISK IOPS/througput , I’ve seen ESDB depleting them happily under heavy loads .
(this means that the server actually is limited by the disk limits in those case)

  • deadlines, mem usage .
    a few possibilities:
  • you have 5 nodes: that requires that every write needs to be acked by at least 2 other nodes , that adds latency to reply to the client.
  • the deadline on the client:you might need to configure the default deadline due to the latency explained above & eventually also depending on where the client process is ( is it in the same region , same private network , anything between the client and the server that could add latency ?)

you mention a custom projection :

  • that require CPU ( on the leader node , that is running the projection) ,should be OK given the graphs you showed.
  • and memory: highly depends on what your projection is doing: if it has a state , that state needs to be loaded in memory . High mem usage I’ve seen in those cases have been tracked down to projection states that were not bounded ( i.e data is added to the state , but never removed)

@yves.lorphelin I am also on Gergely’s team. I would like to clarify about this point:

Our projection does store state. That should be around 500 bytes per stream and let’s say around 13
million streams. So we’re looking at around 6.5gb of state. I think currently we set up our instances to have 64gb of memory and we’re still seeing the memory climb. To the best of my knowledge, we are not appending to state, only overwriting it. Here is roughly what our projection is doing:

fromCategory("{{.inputCategoryName}}")
	.foreachStream()
	.when({
		$init: () => ({
			data: {
				some_field_one: 0,
				some_field_two: 0,
				some_field_three: 0,
				some_field_four: 0,
				some_field_five: "inactive",
				some_field_six: null,
				some_field_seven: null,
			},
		}),
		$any: (state, event) => {
                   // Do some calculations based on event data (basic switch statement, some additions/subtractions)
                   // state = new_state
	             emit(
				`{{.resultCategoryName}}-${state.data.some_field_seven}`,
				event.eventType,
				output_event,
				eventMetadata,
			);
// closing parens

Additionally we have 4/5 default projections disabled. Does the above throw any red flags? We have 1 input stream and 1 output stream per account and it’s basically just calculate new state, replace old state with new state, emit output event.

@sean.payne : no red flag on the above script.

The leader using more memory than the rest is expected behaviour.

Parameter you could tweak:

index-cache-size
This is the only one that will take into account the amount of memory

prepare-and-commit-timeouts , especially since you have a 5 nodes cluster

Can you tell why you have a 5 node cluster ?
That will increase the load on the leader as well as add latency in appending events.

@yves.lorphelin Thank you for the check.

It seems our SRE downgraded to a 3 node cluster on Thursday, so that should be okay.

I noticed that all of the projections turned back on. Currently we just have our application service turn 4 of them off on boot, but I suppose they can turn back on during leader re-election etc. We want one of these to be running, so I can’t completely disable the default projections.

How do you normally deal with this type of scenario? External process that periodically sends disables or you can set up some sort of trigger when a restart occurs?

@yves.lorphelin I think the default projections may have been turned on during a large data import. Is it possible that this would heavily impact our memory usage? I understand there is write amplification, but how much does this bleed over into memory usage?

Update: We can attribute high disk usage to log retention. Reducing this to 2 days immediately plummeted disk usage.

If they were disable they should still be disable after a reboot ( event of each node)
Maybe some configuration change / or the SRE that manually started them ?

the default projections should have low impact
though,

  • since more message are being written some mem usage is to account for
  • they need to catch-up from wherever they were previously stopped , that could impact some of the internal caches

Ah good find !

don’t forget to eventually copy day -1 log files somewhere in case you need them .