Heartbeat timeouts

jarchin · November 19, 2015, 6:26pm

I have it set to 5000 ms on both int and ext., Interval is at 2000.

One of my processes always loses the connection, without Subscription dropped being called, nor OnError / OnClosed being raised by es-connection. And then it never reconnects.

The logs show this:

[PID:05396:008 2015.11.19 17:55:59.786 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:7293, L127.0.0.1:4112, {e615e0e0-c90e-4842-be2d-36de29331b3b}].

[PID:05396:016 2015.11.19 17:56:01.388 TRACE TwoPhaseRequestManag] IDEMPOTENT WRITE TO STREAM ClientCorrelationID a8a471b8-6c6b-495d-9b39-875ce154e265, EventStreamId: Cmds, CorrelationId: c654b43a-35a3-4ac6-b10e-a1484dc820f4, FirstEventNumber: 428, LastEventNumber: 428.

[PID:05396:016 2015.11.19 17:56:01.578 TRACE TwoPhaseRequestManag] IDEMPOTENT WRITE TO STREAM ClientCorrelationID 39f62f09-bb43-4d54-bbb0-8a3a830dd170, EventStreamId: Cmds, CorrelationId: 188487ef-6afe-4131-83ca-95a331ad1a0d, FirstEventNumber: 512, LastEventNumber: 512.

[PID:05396:016 2015.11.19 17:56:01.735 TRACE TwoPhaseRequestManag] IDEMPOTENT WRITE TO STREAM ClientCorrelationID 02f600d6-bce0-4237-91ca-cefb47db7861, EventStreamId: Cmds, CorrelationId: 565b6f8d-5aa9-406e-9bdd-d44314fa9243, FirstEventNumber: 513, LastEventNumber: 513.

[PID:05396:015 2015.11.19 17:56:08.790 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:7293, L127.0.0.1:4112, {e615e0e0-c90e-4842-be2d-36de29331b3b}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 13

[PID:05396:015 2015.11.19 17:56:08.790 INFO TcpConnection ] ES TcpConnection closed [17:56:08.790: N127.0.0.1:7293, L127.0.0.1:4112, {e615e0e0-c90e-4842-be2d-36de29331b3b}]:

Received bytes: 2184, Sent bytes: 255818

Send calls: 14, callbacks: 14

Receive calls: 14, callbacks: 13

Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 13

[PID:05396:015 2015.11.19 17:56:08.790 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:7293, {e615e0e0-c90e-4842-be2d-36de29331b3b}] closed: Success.

[PID:05396:028 2015.11.19 17:56:08.790 DEBUG PersistentSubscripti] Lost connection from 127.0.0.1:7293

This is same wether debugging or not. I’m not putting breakpoints, (which always causes the connection to drop and then reconnect, when hit).

As far as I can tell there is no exception being thrown anywhere. Is it possible to guess at a reason for this timeout?

jarchin · November 19, 2015, 6:28pm

it’s 3.2.2

jarchin · November 21, 2015, 11:03pm

Heartbeat timeouts stopped when changing this:

In a RunAsync() method, changing from this

var res = await _handler.Handle(nextMsg);

couldSave = await _cmdService.Acknowledge(cmd, res);

``

to this

var res = _handler.Handle(nextMsg).Result; couldSave = Acknowledge(cmd, res).Result;

made the sudden heartbeat timeout stop. Awaiting any of those would lead to inevitable nonrecoverable drop.

Awaiting a task within a task has been mentioned in many places, also here in this group by João B, to be redundant.

Now, it would be interesting to know exactly why this caused the subscription to drop and not reconnect. Some sort of deadlock, but how so?

Anyhow, that’s that. Maybe saves some time for the guy to trip on it.

jarchin · November 22, 2015, 8:00pm

Huh… this Keeps biting me.
In a catch clause I have logging, writing to ES, so same thing happens there, and now I’m in this method:

var res = _handler.Handle(nextMsg).Result;

``

…so either, await, .Result or .Wait() on a method in any scope above the call to ES will result in that ES call deadlocking.

This doesn’t seem normal. Am I abusing the async/await pattern? If not, it really should not be this way with ES.

jarchin · November 22, 2015, 8:06pm

Good thing with the logging: I don’t need to await the result. So I just removed the .Result on the writer, it steps right out of the method, and then it doesn’t deadlock on the ES call.