We have a system setup, where we regularly experience contagious domino effects with heartbeat timeouts.
We are very interested in modifying our architecture to reduce or remove this issue.
(We are, also, curious if it would make sense to adjust the evenstore-client-reconnection-strategy dynamically,
so if, instead of doing 10 reconnects each 100 ms, it would e.g. double the reconnect-time-delay (100ms, 200ms, 400 ms up to e.g. 3200 ms),
to dampen the vicious circle. Or if this is crazy talk, in the context of eventstore-connections.)
The central malign effect is:
-
System load triggers heartbeat timeouts
-
reactions to the heartbeat timeout increases system load
-
thus the problem spreads, as other systems initially without problems, now also get more heartbeat timeouts,
causing a vicious circle.
- the vicious circle is fuelled by the fact that heartbeat-dropped connections trigger
re-reads of the last ~50.000 events lost since latest savepoint of our read-projection-databases.
Our systems are spread out across several servers on a LAN -
3 servers host 5 eventstore instances each (single nodes, no clustering).
5 servers host the hundreds of application systems (think: one customer per app).
And all are distributed/connected randomly and evenly (the apps on each app-server will access eventstores across all 15 eventstores.)
A given app only uses one of the 15 eventstore instances.
The patterns for connection-drop, have these obvious(?) characteristics:
- if one of the 3 eventstore-servers comes under load, the eventstoreinstances on that server,
and the subscriber-apps to those instances, will suffer.
-
if one of the 5 eventstore-instances on an eventstore server comes under load, its subscribers will suffer.
-
if one of the app-servers come under load, the apps (their subscriptions) on that server will suffer.
The salient point from the ‘load cause’, is that systems unfortunately don’t fail in isolation;
instead, clusters of stuff will fail at the same time :-/. This greatly worsens the issue.
Now, the concrete tech with which we cause these problems:
The hundreds of apps run as MS-IIS sites, typically in app pool sizes of about 8.
When load (or whatever) trigger heartbeat timeout, they eventually fail to reconnect, the connection is closed, the IIS app-pool end up restarting,
and the restart triggers a massive request for “latest events since last projection-save”.
It’s worsened by that several sites do this at the same time, often causing the problem to spread and persist
(“they keep infecting each other”.)
In the eventstore statistics, which I manually monitor with elasticsearch and kibana,
I can see that ‘pending-send’ size appears and persists for long periods/for the duration of the breakdown.
(in times of calm, ‘pending-send’ is silent/ignorable.)
The pending-send suggests to me, that our client apps are ‘ordering’ more data than either they can manage to process/offload in the short-term,
and/or ordering more data than the eventstore has means to ship (makes sense, if lots of sites all request lots of data at the same time.)
We also routinely see high cpu load / “100%” on the app servers during this,
as they struggle to process the pile of latest-event-data they just re-ordered,
which of course fuels the “heartbeat-timeout” fire.
To stop these breakdowns, often we will simply shut down lots of app pools, and then restart them sequentially
(First we start these 8 sites. after 2-3 minutes, when cpu/network load lightens up, we start the next 8 sites, then wait 2 minutes,
and so on, until all 80/100 sites are running again.)
We can also see varying patterns of ‘partial breakdown’, e.g. apps on one app server willl get into a vicious cycle,
but the 4 other app servers continue to operate normally,
and also “all sites on eventstore-server 1 and 2 have problems, but all on eventstore-server 3 continue to operate normally.”
This suggests there is a bottleneck/flooding/high-watermark issue, where the system fails if load comes over a certain threshold,
but manages to function OK() as long as we remain below. ( I said ‘function’, I didn’t say ‘robust’.)
About ways to fix this:
Because the negative side effects of wanting to re-read lots of latest events,
and the negative effects of ‘wanting to do this at the same time as others’,
my thoughts and focus linger on trying to stop the dropped connections.
IE I’ll ‘tolerate’ the heartbeat timeouts, but I want to (?) stop dropping the connection after 10 retries.
Also, the reconnect policy currently is the 10 x 100ms default, as far as I know.
This suggests to me, that the system strategy currently expresses as "we try to reconnect within one second. If not solved in 1 second, we give up,
and cause a bigger problem".
My thought on this is “could we instead use an exponential wait”, ie
“try in 100ms, try in 200 ms, try in 400 ms, try in 800 ms, try in 1600 ms, try in 3200, give up”
(once I get above e.g. 3 seconds, I won’t continue into 8/16/32 waits, as that would just turn the systems into zombies?)
Basically I’m considering a ‘turtling’ approach, where the systems react in a ‘calming’ way under load, instead of exploding…
A sub question to this is - it is possible to modify any of the eventstore-client connection-settings at runtime,
on an open connection (so far, I just saw the connection-setting class is readonly),
in particular, I’m curious about adjusting the reconnect-delay without closing my connection.)
If we go with the ‘turtle’ approach, we’ll probably have to combine it with a ‘http 503’ filter on the IIS sites,
so they refuse to process HTTP requests as long as we don’t have a live reconnected eventstore-connection. (?)
Can anybody with more eventstore-experience/overview than us,
from the above, tell us ‘what we are basically doing wrong’ / what our worst sins are /what we should address to clean this up?
A thought on the turtling strategy: Given that we see the ‘pending send’ bottlenecks,
I’m wondering that maybe turtling won’t solve the issue, but just ‘push it a few feet further down the hall and be stuck there’.
On the other hand, if the pending-send bottlenecks are actually caused by the massive reconnects,
then limiting the reconnects may be all that is needed.
We also think a major cause of our issues, is our hosting the read-projections inside the IIS;
in particular older IIS was NOT designed for this (later versions of IIS have some provisions for 'background services, though we are not using the currently.)
So we are thinking about moving the projections outside of IIS.
This however would introduce its own issues, since being inside the IIS, means the projections are ‘where we need them’.