Losing connections from CatchUpSubscriptions with HEARTBEAT TIMEOUT

We see alot of the following log-messages in the EventStore log and we struggling to keep our event handlers running.
All read model event processing except for the initial json-deserialize stuff is off-loaded to worker threads (one thread per $by_category subscription).

We use one EventStore connection for all the event handlers and the connection is setup to KeepReconnecting and KeepRetrying.

Any ideas on what we might be doing wrong?

/Thomas

[PID:12328:016 2015.10.29 13:35:09.748 TRACE InMemoryBus ] SLOW BUS MSG [bus]: CommittedEventDistributed - 58ms. Handler: EventReaderCoreService.

[PID:12328:016 2015.10.29 13:35:09.748 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [Projection Core #0]: CommittedEventDistributed - 58ms. Q: 182/182.

[PID:12328:018 2015.10.29 13:35:09.748 TRACE InMemoryBus ] SLOW BUS MSG [bus]: UnwrapEnvelopeMessage - 62ms. Handler: UnwrapEnvelopeHandler.

[PID:12328:018 2015.10.29 13:35:09.748 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [Projection Core #2]: UnwrapEnvelopeMessage - 62ms. Q: 0/251.

[PID:12328:017 2015.10.29 13:35:09.748 TRACE InMemoryBus ] SLOW BUS MSG [bus]: UnwrapEnvelopeMessage - 62ms. Handler: UnwrapEnvelopeHandler.

[PID:12328:017 2015.10.29 13:35:09.748 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [Projection Core #1]: UnwrapEnvelopeMessage - 62ms. Q: 0/251.

[PID:12328:016 2015.10.29 13:35:09.835 TRACE CoreProjectionCheckp] Writing checkpoint for $by_category at C:36027144/P:36027144 with expected version number 17

[PID:12328:016 2015.10.29 13:35:09.835 TRACE CoreProjectionCheckp] Checkpoint has been written for projection $by_category at sequence number 18 (current)

[PID:12328:016 2015.10.29 13:35:10.239 TRACE CoreProjectionCheckp] Writing checkpoint for $by_category at C:37475148/P:37475148 with expected version number 18

[PID:12328:016 2015.10.29 13:35:10.266 TRACE CoreProjectionCheckp] Checkpoint has been written for projection $by_category at sequence number 19 (current)

[PID:12328:016 2015.10.29 13:35:10.324 TRACE InMemoryBus ] SLOW BUS MSG [bus]: CoreTick - 53ms. Handler: ProjectionCoreService.

[PID:12328:016 2015.10.29 13:35:10.324 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [Projection Core #0]: CoreTick - 53ms. Q: 0/1.

[PID:12328:017 2015.10.29 13:35:10.625 TRACE InMemoryBus ] SLOW BUS MSG [bus]: UnwrapEnvelopeMessage - 60ms. Handler: UnwrapEnvelopeHandler.

[PID:12328:017 2015.10.29 13:35:10.625 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [Projection Core #1]: UnwrapEnvelopeMessage - 60ms. Q: 0/251.

[PID:12328:018 2015.10.29 13:35:10.625 TRACE InMemoryBus ] SLOW BUS MSG [bus]: UnwrapEnvelopeMessage - 60ms. Handler: UnwrapEnvelopeHandler.

[PID:12328:018 2015.10.29 13:35:10.625 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [Projection Core #2]: UnwrapEnvelopeMessage - 60ms. Q: 0/251.

[PID:12328:016 2015.10.29 13:35:10.625 TRACE InMemoryBus ] SLOW BUS MSG [bus]: WriteEventsCompleted - 66ms. Handler: RequestResponseDispatcher`2.

[PID:12328:016 2015.10.29 13:35:10.625 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [Projection Core #0]: WriteEventsCompleted - 66ms. Q: 0/3.

[PID:12328:016 2015.10.29 13:35:10.768 TRACE CoreProjectionCheckp] Writing checkpoint for $by_category at C:38923845/P:38923845 with expected version number 19

What are your timeouts set for? There are two settings one on the
connection one on the server.

What size messages are you pushing?

What is batch size on subscription?

What kind of network are you running on?

Whats the ping rate between the machine opening the subscriptions and
the server?

Greg

<What are your timeouts set for?>
Both are default ES 3.3 settings

<What size messages are you pushing? >

Most are 1k bytes or less. Some are 150k bytes.

<What is batch size on subscription? >

Default ES 3.3 which is 500

<What kind of network are you running on? >

Single machine developer PC. We are not live yet :slight_smile:

We are running all our services + ES + MongoDB on the same machine. This will not be the end setup, but we should be able to run a stable, less responsive’ setup on a single machine I guess.

<Whats the ping rate between the machine opening the subscriptions and
the server?>

localhost <1ms

/Thomas

500 * 150kb = 7.5MB

Default timeouts:

1.5 seconds / 1 second client
700ms server

Its not surprising you are not getting a 7.5MB message across within a
timeout period.

Lower batch size or raise timeouts.

also 150kb for an event is huge. Are you putting image data etc in them?

your worker threads offloading…we’ve had problems in the past with heartbeat timeouts when something else in the process has been using the threadpool heavily. Watch out for that.

Thanks. We will try that.

<Are you putting image data etc in them? >
No, we have an array of coherent observations in one event - mostly because this how we receive the information from the market.

In many cases we should be able to split the data into multiple events without ruining the domain language.

/Thomas

<we’ve had problems in the past with heartbeat timeouts when something else in the process has been using the threadpool heavily>

How did you solve your issues?

We have many scenarios where our read model worker threads will have to compete with other tasks for the thread pool.

/Thomas

honestly I just upped the minimum number of threads the threadpool was allowed to keep around and that worked. In this case we were using Akka.Net to read lots of streams concurrently and we had lots of heartbeat timeouts.

cheers