Strange heartbeat timeouts

Turns out our heartbeat-related issue wasn’t related to System Endpoint Protection i posted about the other day. They’ve disabled network monitoring, same problem. To recap: 50% ram free, cpu at 10-20%, no disk queues.

I increased the hb-timeout by 10 (so now (10/20 sec), no difference.

Server log says:

[PID:03928:009 2016.07.15 07:34:28.108 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:56451, L127.0.0.1:1113, {a8ccb238-239b-454d-87bf-44550030b630}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 50

[PID:03928:009 2016.07.15 07:34:28.108 INFO TcpConnection ] ES TcpConnection closed [07:34:28.108: N127.0.0.1:56451, L127.0.0.1:1113, {a8ccb238-239b-454d-87bf-44550030b630}]:Received bytes: 1655, Sent bytes: 170281

[PID:03928:009 2016.07.15 07:34:28.108 INFO TcpConnection ] ES TcpConnection closed [07:34:28.108: N127.0.0.1:56451, L127.0.0.1:1113, {a8ccb238-239b-454d-87bf-44550030b630}]:Send calls: 51, callbacks: 51

[PID:03928:009 2016.07.15 07:34:28.108 INFO TcpConnection ] ES TcpConnection closed [07:34:28.108: N127.0.0.1:56451, L127.0.0.1:1113, {a8ccb238-239b-454d-87bf-44550030b630}]:Receive calls: 51, callbacks: 50

[PID:03928:009 2016.07.15 07:34:28.108 INFO TcpConnection ] ES TcpConnection closed [07:34:28.108: N127.0.0.1:56451, L127.0.0.1:1113, {a8ccb238-239b-454d-87bf-44550030b630}]:Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 50

[PID:03928:009 2016.07.15 07:34:28.108 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:56451, {a8ccb238-239b-454d-87bf-44550030b630}] closed: Success.

The connections live for 1-10 minutes. Nothing else in the logs. (no slow messages).

Client log says:

ClientAPI TcpConnection closed [07:34:28.108: N127.0.0.1:1113, L127.0.0.1:56419, {8172663d-a994-4170-937c-fc3df71c7276}]: Received bytes: 199201, Sent bytes: 2119 Send calls: 92, callbacks: 92 Receive calls: 175, callbacks: 175 Close reason: [Success] Socket closed

Question: Is this server or client initiated close? How can I debug it further? (will enable client verbose logging)

/Peter

There are timeouts set by client as well. What are you setting them to?

They’re at 10/20 sec now as well. Saw no difference in behaviour at all.

/Peter

All that happens with timeouts is:

If no message is sent during a a duration a heartbeat is sent.
If that heartbeat does not get a response within timeout close connection.

Usually the cause is a retry etc over tcp or a broken socket.

This is my client config:

.SetHeartbeatInterval(TimeSpan.FromSeconds(10))
.SetHeartbeatTimeout(TimeSpan.FromSeconds(20))

But HeartbeatRequestCommand is sent every second?

2016-07-15 10:20:18.341 +02:00 [Debug] EventStream: EventStoreConnection '"ES-f224ae72-84d5-42d4-9e3a-4b6957058545"': "enqueueing message EventStore.ClientAPI.Internal.HandleTcpPackageMessage.".
2016-07-15 10:20:18.341 +02:00 [Debug] EventStream: EventStoreConnection '"ES-f224ae72-84d5-42d4-9e3a-4b6957058545"': "HandleTcpPackage connId 96116092-1e6c-4554-9aac-97ba4182c855, package HeartbeatRequestCommand, 16be95e0-df06-4989-8a90-f5085a9792cd.".
2016-07-15 10:20:19.747 +02:00 [Debug] EventStream: EventStoreConnection '"ES-f224ae72-84d5-42d4-9e3a-4b6957058545"': "enqueueing message EventStore.ClientAPI.Internal.HandleTcpPackageMessage.".
2016-07-15 10:20:19.747 +02:00 [Debug] EventStream: EventStoreConnection '"ES-f224ae72-84d5-42d4-9e3a-4b6957058545"': "HandleTcpPackage connId 96116092-1e6c-4554-9aac-97ba4182c855, package HeartbeatRequestCommand, 6f264d05-d799-4da7-94bb-81bbfe007752.".
2016-07-15 10:20:21.153 +02:00 [Debug] EventStream: EventStoreConnection '"ES-f224ae72-84d5-42d4-9e3a-4b6957058545"': "enqueueing message EventStore.ClientAPI.Internal.HandleTcpPackageMessage.".
2016-07-15 10:20:21.153 +02:00 [Debug] EventStream: EventStoreConnection '"ES-f224ae72-84d5-42d4-9e3a-4b6957058545"': "HandleTcpPackage connId 96116092-1e6c-4554-9aac-97ba4182c855, package HeartbeatRequestCommand, 6825bc8b-d761-453b-830d-556eb52b4bac.".
2016-07-15 10:20:22.560 +02:00 [Debug] EventStream: EventStoreConnection '"ES-f224ae72-84d5-42d4-9e3a-4b6957058545"': "enqueueing message EventStore.ClientAPI.Internal.HandleTcpPackageMessage.".
2016-07-15 10:20:22.560 +02:00 [Debug] EventStream: EventStoreConnection '"ES-f224ae72-84d5-42d4-9e3a-4b6957058545"': "HandleTcpPackage connId 96116092-1e6c-4554-9aac-97ba4182c855, package HeartbeatRequestCommand, 4b0670fb-32c0-40df-abb9-23f086f69a97.".

I also got my first disconnect after 4 sec.

Am I misunderstanding or is the heartbeat settings not honored?

Client is v3.3.1, server is 3.5, but these versions work fine on other machines.

/Peter

what are your server settings?

ExtTcpHeartbeatTimeout: 10000
ExtTcpHeartbeatInterval: 20000

and you are getting heartbeats once per second? I cannot reproduce this here.

Yep, at least according to the verbose client log below. (got limited access to the server, or I’d run Wireshark and have a look).

Are the client and server heartbeats completely separate, or do they interact somehow? It’s a bit odd that interval > timeout on server, but < on client (I based it on the defaults).

they are completely separate.

you can test this trivially without using a client.
bin/clusternode/EventStore.ClusterNode.exe
--ext-tcp-heartbeat-interval 5000

telnet 127.0.0.1 1113

That was interesting.

On my machine this works as expected. The connection is closed according to heartbeattimeout.

On the problematic server it’s closed in under a second. Even though it’s configured for 10 secs:

  EXT TCP HEARTBEAT TIMEOUT: 10000 (Config File)
  EXT TCP HEARTBEAT INTERVAL: 20000 (Config File)

What could possibly close the socket?

You will see a heartbeat message in telnet before it closes due to heartbeat.

Yep, I get the message, but it closes immediately after. On my machine it waits for the timeout.

Tried disabling antivirus, firewall and offloading to NIC. But keeps dropping connection/timing out.

Assuming it’s some windows crap, but it seems lika it’s only Event Store thats affected, our own services keep their connections. And everything works, it just reconnects every 1-10 minutes.

/Peter

does the server log say that ES is closing the connection?

Yes:

[PID:03928:009 2016.07.15 07:34:28.108 TRACE TcpConnectionManager] Closing
connection 'external-normal' [127.0.0.1:56451, L127.0.0.1:1113,
{a8ccb238-239b-454d-87bf-44550030b630}] cleanly. Reason: HEARTBEAT TIMEOUT
at msgNum 50
[PID:03928:009 2016.07.15 07:34:28.108 INFO TcpConnection ] ES
TcpConnection closed [07:34:28.108: N127.0.0.1:56451, L127.0.0.1:1113,
{a8ccb238-239b-454d-87bf-44550030b630}]:Received bytes: 1655, Sent bytes:
170281
[PID:03928:009 2016.07.15 07:34:28.108 INFO TcpConnection ] ES
TcpConnection closed [07:34:28.108: N127.0.0.1:56451, L127.0.0.1:1113,
{a8ccb238-239b-454d-87bf-44550030b630}]:Send calls: 51, callbacks: 51
[PID:03928:009 2016.07.15 07:34:28.108 INFO TcpConnection ] ES
TcpConnection closed [07:34:28.108: N127.0.0.1:56451, L127.0.0.1:1113,
{a8ccb238-239b-454d-87bf-44550030b630}]:Receive calls: 51, callbacks: 50
[PID:03928:009 2016.07.15 07:34:28.108 INFO TcpConnection ] ES
TcpConnection closed [07:34:28.108: N127.0.0.1:56451, L127.0.0.1:1113,
{a8ccb238-239b-454d-87bf-44550030b630}]:Close reason: [Success] HEARTBEAT
TIMEOUT at msgNum 50
[PID:03928:009 2016.07.15 07:34:28.108 INFO TcpConnectionManager]
Connection 'external-normal' [127.0.0.1:56451,
{a8ccb238-239b-454d-87bf-44550030b630}] closed: Success.

But I’m assuming thats because it never got the heartbeat, since the socket was broken, at least what the telnet test indicated.

Yes it will close the socket if it doesnt receive anything.

But you are saying if you set the timeout to 10 seconds es will close it in 1?

Yes. At least over telnet.

But the system works, there are packets being sent, so it can’t be the same behaviour for clients. Sometimes they’re disconnected after seconds, sometimes up to 10 mins.

Don’t know if ‘regular’ packets (non-heartbeat) kan keep the socket open?

yes they do. heartbeats are only sent if no other traffic is seen
during the interval

re: that the server closes too early, as I said I cannot reproduce
this. It closes as it should in my tests.