heartbeat timeout crowd control

We have a system setup, where we regularly experience contagious domino effects with heartbeat timeouts.
We are very interested in modifying our architecture to reduce or remove this issue.

(We are, also, curious if it would make sense to adjust the evenstore-client-reconnection-strategy dynamically,

so if, instead of doing 10 reconnects each 100 ms, it would e.g. double the reconnect-time-delay (100ms, 200ms, 400 ms up to e.g. 3200 ms),

to dampen the vicious circle. Or if this is crazy talk, in the context of eventstore-connections.)

The central malign effect is:

  • System load triggers heartbeat timeouts

  • reactions to the heartbeat timeout increases system load

  • thus the problem spreads, as other systems initially without problems, now also get more heartbeat timeouts,

causing a vicious circle.

  • the vicious circle is fuelled by the fact that heartbeat-dropped connections trigger

re-reads of the last ~50.000 events lost since latest savepoint of our read-projection-databases.

Our systems are spread out across several servers on a LAN -

3 servers host 5 eventstore instances each (single nodes, no clustering).

5 servers host the hundreds of application systems (think: one customer per app).

And all are distributed/connected randomly and evenly (the apps on each app-server will access eventstores across all 15 eventstores.)

A given app only uses one of the 15 eventstore instances.

The patterns for connection-drop, have these obvious(?) characteristics:

  • if one of the 3 eventstore-servers comes under load, the eventstoreinstances on that server,

and the subscriber-apps to those instances, will suffer.

  • if one of the 5 eventstore-instances on an eventstore server comes under load, its subscribers will suffer.

  • if one of the app-servers come under load, the apps (their subscriptions) on that server will suffer.

The salient point from the ‘load cause’, is that systems unfortunately don’t fail in isolation;

instead, clusters of stuff will fail at the same time :-/. This greatly worsens the issue.

Now, the concrete tech with which we cause these problems:

The hundreds of apps run as MS-IIS sites, typically in app pool sizes of about 8.

When load (or whatever) trigger heartbeat timeout, they eventually fail to reconnect, the connection is closed, the IIS app-pool end up restarting,

and the restart triggers a massive request for “latest events since last projection-save”.

It’s worsened by that several sites do this at the same time, often causing the problem to spread and persist

(“they keep infecting each other”.)

In the eventstore statistics, which I manually monitor with elasticsearch and kibana,

I can see that ‘pending-send’ size appears and persists for long periods/for the duration of the breakdown.

(in times of calm, ‘pending-send’ is silent/ignorable.)

The pending-send suggests to me, that our client apps are ‘ordering’ more data than either they can manage to process/offload in the short-term,

and/or ordering more data than the eventstore has means to ship (makes sense, if lots of sites all request lots of data at the same time.)

We also routinely see high cpu load / “100%” on the app servers during this,

as they struggle to process the pile of latest-event-data they just re-ordered,

which of course fuels the “heartbeat-timeout” fire.

To stop these breakdowns, often we will simply shut down lots of app pools, and then restart them sequentially

(First we start these 8 sites. after 2-3 minutes, when cpu/network load lightens up, we start the next 8 sites, then wait 2 minutes,

and so on, until all 80/100 sites are running again.)

We can also see varying patterns of ‘partial breakdown’, e.g. apps on one app server willl get into a vicious cycle,

but the 4 other app servers continue to operate normally,

and also “all sites on eventstore-server 1 and 2 have problems, but all on eventstore-server 3 continue to operate normally.”

This suggests there is a bottleneck/flooding/high-watermark issue, where the system fails if load comes over a certain threshold,

but manages to function OK() as long as we remain below. ( I said ‘function’, I didn’t say ‘robust’.)

About ways to fix this:

Because the negative side effects of wanting to re-read lots of latest events,

and the negative effects of ‘wanting to do this at the same time as others’,

my thoughts and focus linger on trying to stop the dropped connections.

IE I’ll ‘tolerate’ the heartbeat timeouts, but I want to (?) stop dropping the connection after 10 retries.

Also, the reconnect policy currently is the 10 x 100ms default, as far as I know.

This suggests to me, that the system strategy currently expresses as "we try to reconnect within one second. If not solved in 1 second, we give up,

and cause a bigger problem".

My thought on this is “could we instead use an exponential wait”, ie

“try in 100ms, try in 200 ms, try in 400 ms, try in 800 ms, try in 1600 ms, try in 3200, give up”

(once I get above e.g. 3 seconds, I won’t continue into 8/16/32 waits, as that would just turn the systems into zombies?)

Basically I’m considering a ‘turtling’ approach, where the systems react in a ‘calming’ way under load, instead of exploding…

A sub question to this is - it is possible to modify any of the eventstore-client connection-settings at runtime,

on an open connection (so far, I just saw the connection-setting class is readonly),

in particular, I’m curious about adjusting the reconnect-delay without closing my connection.)

If we go with the ‘turtle’ approach, we’ll probably have to combine it with a ‘http 503’ filter on the IIS sites,

so they refuse to process HTTP requests as long as we don’t have a live reconnected eventstore-connection. (?)

Can anybody with more eventstore-experience/overview than us,

from the above, tell us ‘what we are basically doing wrong’ / what our worst sins are /what we should address to clean this up?

A thought on the turtling strategy: Given that we see the ‘pending send’ bottlenecks,

I’m wondering that maybe turtling won’t solve the issue, but just ‘push it a few feet further down the hall and be stuck there’.

On the other hand, if the pending-send bottlenecks are actually caused by the massive reconnects,

then limiting the reconnects may be all that is needed.

We also think a major cause of our issues, is our hosting the read-projections inside the IIS;

in particular older IIS was NOT designed for this (later versions of IIS have some provisions for 'background services, though we are not using the currently.)

So we are thinking about moving the projections outside of IIS.

This however would introduce its own issues, since being inside the IIS, means the projections are ‘where we need them’.

a few details I forgot:

  • “100% cpu” on the app servers of course is condusive to having the problems,

but the problem can certainly persist, even if the app servers are only at moderate load

(however, in those cases, we still see the ‘pending send’ queues on the eventstores,

so ‘pending send’ seems to always trigger whenever we cause these problems.)

  • we are on eventstore 3.9.4, I think. We are / have been about to switch to a newer version of ES, but it is not our impression the

problem goes away just tied to the ES.)

  • “100% cpu” does not mean we believe “well, it’s OK the servers once in a while run on 100%cpu”,

it mostly means “the problem is so bad we eventually max out at 100”.

I am reading this and am very confused. What disks are you running on?

Heartbeat timeouts are network based. Overall load is normally storage based.

we are running on “fast” SSD disks (in raid, purely for safety).

Also, another detail: I’m pretty sure the initial announcer of the heartbeat timeout,

is the eventstore server, NOT the client (this may be ‘full moon has no clouds’ bias, maybe we don’t log client timeouts as meticulously…)

In terms of supporting exponential back offs it is pretty trivial to implement. https://github.com/EventStore/EventStore/blob/release-v4.1.1/src/EventStore.ClientAPI/Internal/EventStoreConnectionLogicHandler.cs#L157 I do not believe however that it will help in this case. My guess is something else is going on.

are there any performance statistics that might throw further light on it,
we stare a lot on the hundreds of performance characteristics, but glean little from them

(we ship the statistics stream into the ELK stack.)

I am reading this and am very confused. What disks are you running on?
Heartbeat timeouts are network based.

Overall load is normally storage based.

My thoughts on this:

If one party - the app servers - have a heavy cpu load,

I would assume this too could cause heartbeat timeouts,

because the eventstore clients wouldn’t be scheduled often enough, with the cpu busy elsewhere.

One supporting indication for this, is that it seems

we are able to ‘force’ a breakdown, by artificially putting cpu-load on an app server

(ie if I choose to run some random tool that hogs significant amounts of cpu,

I can trigger a spike in the heartbeat timeouts.)

But I know this is not the only thing, as I’ve seen breakdowns even with “moderate” (40-60%) cpu load on app server (60% load is still high, IMO.)

Regarding network:

What we consistently observe during breakdowns,

is that the eventstore servers “max out” on sent network traffic:

They continuously ship huge amounts of network traffic to the affected app server.

They then do this at a constant rate,

making me believe that we are, obviously, hitting the roof of whatever

the lowest bottleneck is (either how much the eventstore server can ship in its pipeline,

or how much the app servers are able to suck in/process.)

Note this is matched by the ‘pending send’ queues on the eventstores lighting up.

Our impression is, that this network traffic is the “please send me the last 0-99999 lost events” requests

from the continually recycling app servers.

A further detail from our architecture.

Our systems are mostly running on the ‘all stream’, in 3.9.4.

We would have preferred to run on system-systems, but apparently a bug in 3.9.4

prohibited us from ever getting this to work robustly (the system-projections would randomly die/stop without alerting us.)

With all-stream: As each eventstore will run anywhere from 20 to 80 customer-systems,

this means that every single eventstore client receives traffic intended for 30ish systems,

and in 29 out of 30 cases, inspects the event and decides “well this was never intended for me”.

We are currently gradually switching to system-streams;

we are unsure to which extent this will alleviate our problems

(even on system streams, we still have the same architecture with

projection-subscriptions living inside IIS)

Also, again regarding disk activity: I’m not sure I see much disk activity anywhere in the system

(I see, once in a while, one of the eventstore instances doing huge reads,

but never anything resembling the huge network traffic we see.)

I’ll go hunt some disk activity, when I get back to work tomorrow.

Once a second in not a stampede. There is something failing below. At 1/ms sure you get a stampeded.

/s/a//

Here a graph from one of the shorter ‘selfhealing’ crashes:

it shows an initial 30 seconds where the app servers struggle to even open a connection to the eventstore server.

After that, follows 5 minutes with swarms of heartbeat timeouts

Some graphs/statistics here. The green graph is a tiny crash, that self-corrected in 5 minutes, just before noon.
The blue graph is a two-hour crash, from 5:30 to 7:30 in the morning.

The blue graph shows 2 out of about 15 eventstores.

The dark blue eventstore participated in the first few minutes of the two-hour crash, but corrected rather quickly and braved the rest of it.

The light blue eventstore, not so lucky, it took part in the entire (?) two hour party.

The green 5-minute crash is notable in being the same crash I showed a bit of log from earlier.

In the 5-minute crash, there was an initial 30 seconds, where the eventstore clients

reported their ‘close reason’ as “failed to establish a connection”.

For the remaining 4:30 minutes, this error disappeared to be replaced by “normal” heartbeat timeouts.

(I am aware these graphs are tiny and barely readable.

However, for the curious, it is actually possible to decipher the legend by zooming in.)

Also, a crucial tidbit of information I realise I left out: (NOT the full cause).

Sometimes, we’ll post “huge” bursts of events, e.g. ~5000 events.

I’m not sure if this is misuse of the eventstore technology.

Misuse or not, what we observe is that these spikes of

many events in quick succession, invariably/very often trigger some of our system breakdowns.

It is not totally surprising to us, given that

we are using “all” streams instead of system streams(*2).

This means that these thousands of events must be ‘broadcast’ to between 30 and 80 app systems that must deserialize the events

just to discover than in 79 times out of 80, it was not for them.)

However, we also see these crashes in cases where there wasn’t created thousands of events.

So even though we are working actively

to avoid the ‘thousand-events’, and to switch to system-streams,

we have the impression there is still a grave problem at the bottom of this,

that those things won’t get rid of.

Currently, I am trying to get a lot of runtime-statistics from the app-servers,

to see if the sickness lie there.

(*2) (about one-fourth of our systems have completed the transition to system-streams,

but it is cumbersome for us to complete these conversions while still keeping our client systems online.)