Recovering from heartbeat timeouts

Sorry to sound like a broken record, but I just can’t get this working properly. Spent last night testing locally, I can recover from pretty much any scenario (stopping and restarting eventstore while running etc). But not from heartbeat timeouts (that I can’t recreate reliably, even when setting them to millisecond).

I reconnect, then after a little while I get another timeout. Until I restart my service/process, then it’s fine again. I can’t for the life of me understand what kind of state I have that causes this…

My subscription handlers just puts things on an internal queue, so there’s nothing blocking there either. Today it ran for 3-4 hours before acting up, the load at the beginning of the day was much higher, so I don’t think it’s related to that either.

The current plan is to just restart the service if I get a disconnect. Brute force recovery.

Environment:

Server has 8 cores and 16gb ram. Eventstore has a dedicated 10.000 rpm 1tb raid, with no other databases on it. (There are imagefile on the same disk but they’re rarely written. max 2-3 a minute)

Log:

[PID:05500:013 2014.07.19 12:12:32.105 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:59696, L127.0.0.1:1113, {854d442c-5fd9-42d6-9601-821dd6f1ad45}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 1179513

[PID:05500:013 2014.07.19 12:12:32.138 INFO TcpConnection ] [12:12:32.134: N127.0.0.1:59696, L127.0.0.1:1113, {854d442c-5fd9-42d6-9601-821dd6f1ad45}]:

Received bytes: 127253113, Sent bytes: 36216000645

Send calls: 1276943, callbacks: 1276943

Receive calls: 1179929, callbacks: 1179928

Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 1179513

[PID:05500:013 2014.07.19 12:12:32.157 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:59696, {854d442c-5fd9-42d6-9601-821dd6f1ad45}] closed: Success.

[PID:05500:003 2014.07.19 12:12:49.124 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61567, L127.0.0.1:1113, {948f147c-b5cd-460d-937a-2f3342f9818a}].

[PID:05500:003 2014.07.19 12:12:49.593 INFO TcpConnection ] [12:12:49.593: N127.0.0.1:61567, L127.0.0.1:1113, {948f147c-b5cd-460d-937a-2f3342f9818a}]:

Received bytes: 242, Sent bytes: 37307

Send calls: 4, callbacks: 4

Receive calls: 5, callbacks: 5

Close reason: [Success] Socket closed

[PID:05500:003 2014.07.19 12:12:49.593 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:61567, {948f147c-b5cd-460d-937a-2f3342f9818a}] closed: Success.

[PID:05500:003 2014.07.19 12:13:19.400 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61574, L127.0.0.1:1113, {3e8018ce-ca4b-41e7-9bc8-7a0ad003a5bf}].

[PID:05500:013 2014.07.19 12:17:25.445 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:61574, L127.0.0.1:1113, {3e8018ce-ca4b-41e7-9bc8-7a0ad003a5bf}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 272

[PID:05500:013 2014.07.19 12:17:25.445 INFO TcpConnection ] [12:17:25.445: N127.0.0.1:61574, L127.0.0.1:1113, {3e8018ce-ca4b-41e7-9bc8-7a0ad003a5bf}]:

Received bytes: 53707, Sent bytes: 3357323

Send calls: 403, callbacks: 403

Receive calls: 273, callbacks: 272

Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 272

[PID:05500:013 2014.07.19 12:17:25.445 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:61574, {3e8018ce-ca4b-41e7-9bc8-7a0ad003a5bf}] closed: Success.

[PID:05500:022 2014.07.19 12:18:11.437 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61613, L127.0.0.1:1113, {c82c591a-3394-4b61-900e-3cfb5ae47e57}].

[PID:05500:030 2014.07.19 12:18:36.883 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue #1]: ReadAllEventsForward - 500ms. Q: 0/0.

[PID:05500:012 2014.07.19 12:18:41.564 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [StorageWriterQueue]: WriteTransactionData - 720ms. Q: 0/2.

[PID:05500:013 2014.07.19 12:18:57.442 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:61613, L127.0.0.1:1113, {c82c591a-3394-4b61-900e-3cfb5ae47e57}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 1442

[PID:05500:013 2014.07.19 12:18:57.442 INFO TcpConnection ] [12:18:57.442: N127.0.0.1:61613, L127.0.0.1:1113, {c82c591a-3394-4b61-900e-3cfb5ae47e57}]:

Received bytes: 109227, Sent bytes: 45337342

Send calls: 1477, callbacks: 1477

Receive calls: 1443, callbacks: 1442

Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 1442

[PID:05500:013 2014.07.19 12:18:57.442 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:61613, {c82c591a-3394-4b61-900e-3cfb5ae47e57}] closed: Success.

[PID:05500:013 2014.07.19 12:19:08.317 TRACE InMemoryBus ] SLOW BUS MSG [Core Output]: ReadAllEventsForward - 404ms. Handler: RequestResponseQueueForwarder.

[PID:05500:012 2014.07.19 12:19:42.472 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 653ms. Q: 0/1.

[PID:05500:022 2014.07.19 12:20:02.375 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61658, L127.0.0.1:1113, {2a2f1909-42e3-4e13-8995-d0609afa436d}].

[PID:05500:013 2014.07.19 12:20:10.376 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:61658, L127.0.0.1:1113, {2a2f1909-42e3-4e13-8995-d0609afa436d}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 0

[PID:05500:013 2014.07.19 12:20:10.376 INFO TcpConnection ] [12:20:10.376: N127.0.0.1:61658, L127.0.0.1:1113, {2a2f1909-42e3-4e13-8995-d0609afa436d}]:

Received bytes: 0, Sent bytes: 22

Send calls: 1, callbacks: 1

Receive calls: 1, callbacks: 0

Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 0

[PID:05500:013 2014.07.19 12:20:10.376 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:61658, {2a2f1909-42e3-4e13-8995-d0609afa436d}] closed: Success.

[PID:05500:012 2014.07.19 12:20:14.614 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 3068ms. Q: 0/0.

[PID:05500:014 2014.07.19 12:20:14.692 TRACE InMemoryBus ] SLOW BUS MSG [MainBus]: RequestManagerTimerTick - 280ms. Handler: RequestManagementService.

[PID:05500:014 2014.07.19 12:20:14.692 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [MainQueue]: RequestManagerTimerTick - 348ms. Q: 0/5.

[PID:05500:031 2014.07.19 12:20:48.863 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61671, L127.0.0.1:1113, {58c2b74d-0429-4945-bb07-83f25b192409}].

[PID:05500:013 2014.07.19 12:20:56.864 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:61671, L127.0.0.1:1113, {58c2b74d-0429-4945-bb07-83f25b192409}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 0

[PID:05500:013 2014.07.19 12:20:56.864 INFO TcpConnection ] [12:20:56.864: N127.0.0.1:61671, L127.0.0.1:1113, {58c2b74d-0429-4945-bb07-83f25b192409}]:

Received bytes: 0, Sent bytes: 22

Send calls: 1, callbacks: 1

Receive calls: 1, callbacks: 0

Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 0

[PID:05500:013 2014.07.19 12:20:56.864 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:61671, {58c2b74d-0429-4945-bb07-83f25b192409}] closed: Success.

[PID:05500:033 2014.07.19 12:21:24.090 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61674, L127.0.0.1:1113, {4340b9de-199d-4407-8a82-4f630821be7b}].

[PID:05500:033 2014.07.19 12:21:27.140 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61675, L127.0.0.1:1113, {d30d50b9-2c5a-4419-8f97-ec8e5ec26db3}].

[PID:05500:013 2014.07.19 12:21:32.092 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:61674, L127.0.0.1:1113, {4340b9de-199d-4407-8a82-4f630821be7b}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 0

[PID:05500:013 2014.07.19 12:21:32.092 INFO TcpConnection ] [12:21:32.092: N127.0.0.1:61674, L127.0.0.1:1113, {4340b9de-199d-4407-8a82-4f630821be7b}]:

Received bytes: 0, Sent bytes: 22

Send calls: 1, callbacks: 1

Receive calls: 1, callbacks: 0

Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 0

[PID:05500:013 2014.07.19 12:21:32.092 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:61674, {4340b9de-199d-4407-8a82-4f630821be7b}] closed: Success.

[PID:05500:013 2014.07.19 12:21:35.142 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:61675, L127.0.0.1:1113, {d30d50b9-2c5a-4419-8f97-ec8e5ec26db3}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 0

[PID:05500:013 2014.07.19 12:21:35.142 INFO TcpConnection ] [12:21:35.142: N127.0.0.1:61675, L127.0.0.1:1113, {d30d50b9-2c5a-4419-8f97-ec8e5ec26db3}]:

Received bytes: 0, Sent bytes: 22

Send calls: 1, callbacks: 1

Receive calls: 1, callbacks: 0

Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 0

[PID:05500:013 2014.07.19 12:21:35.142 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:61675, {d30d50b9-2c5a-4419-8f97-ec8e5ec26db3}] closed: Success.

[PID:05500:012 2014.07.19 12:21:42.698 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 1062ms. Q: 0/0.

[PID:05500:016 2014.07.19 12:21:54.339 TRACE InMemoryBus ] SLOW BUS MSG [manager input bus]: RegularTimeout - 146ms. Handler: ProjectionManager.

[PID:05500:016 2014.07.19 12:21:54.339 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [Projections Master]: RegularTimeout - 148ms. Q: 0/1.

[PID:05500:015 2014.07.19 12:21:54.339 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 198ms. Q: 0/0.

[PID:05500:023 2014.07.19 12:23:34.975 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61745, L127.0.0.1:1113, {67f76943-8b67-4f3f-8081-407e3a857d72}].

[PID:05500:013 2014.07.19 12:23:37.183 TRACE InMemoryBus ] SLOW BUS MSG [Core Output]: ReadAllEventsForward - 56ms. Handler: RequestResponseQueueForwarder.

[PID:05500:016 2014.07.19 12:23:37.183 TRACE InMemoryBus ] SLOW BUS MSG [manager input bus]: RegularTimeout - 105ms. Handler: ProjectionManager.

[PID:05500:016 2014.07.19 12:23:37.183 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [Projections Master]: RegularTimeout - 109ms. Q: 0/1.

[PID:05500:015 2014.07.19 12:23:37.183 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 153ms. Q: 0/0.

[PID:05500:012 2014.07.19 12:23:43.266 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 1568ms. Q: 0/0.

[PID:05500:013 2014.07.19 12:23:45.976 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:61745, L127.0.0.1:1113, {67f76943-8b67-4f3f-8081-407e3a857d72}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 1

[PID:05500:013 2014.07.19 12:23:45.976 INFO TcpConnection ] [12:23:45.976: N127.0.0.1:61745, L127.0.0.1:1113, {67f76943-8b67-4f3f-8081-407e3a857d72}]:

Received bytes: 37, Sent bytes: 44

Send calls: 2, callbacks: 2

Receive calls: 2, callbacks: 1

Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 1

[PID:05500:013 2014.07.19 12:23:45.983 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:61745, {67f76943-8b67-4f3f-8081-407e3a857d72}] closed: Success.

[PID:05500:012 2014.07.19 12:24:43.566 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [StorageWriterQueue]: WriteCommit - 1320ms. Q: 0/0.

[PID:05500:031 2014.07.19 12:25:04.517 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61764, L127.0.0.1:1113, {615ce44f-5db5-4c30-b334-93fcedc1e1aa}].

I isolated all connectionhandling to it’s own class (that became rather more complicated than I wanted it to, but I wanted to be 100% sure I didn’t fiddle with the connection on more than 1 thread at the time):

public class ConnectionHandler : IDisposable

{

private readonly ConnectionSettingsBuilder _settings;

private readonly Core.Log.ILogger _logger;

private readonly IPEndPoint _ipEndPoint;

private readonly ManualResetEvent _gate = new ManualResetEvent(false);

private IEventStoreConnection _connection = null;

private readonly Subject _status = new Subject();

private bool _closing = false;

private int _attempts = 0;

public ConnectionHandler(IPEndPoint ipEndPoint, Core.Log.ILogger logger)

{

_ipEndPoint = ipEndPoint;

_logger = logger;

_settings = ConnectionSettings.Create()

.UseCustomLogger(new EventStoreLog4netLogger())

.DisableVerboseLogging()

.FailOnNoServerResponse()

.SetHeartbeatInterval(TimeSpan.FromSeconds(10))

.SetHeartbeatTimeout(TimeSpan.FromMinutes(2))

.KeepRetrying()

.LimitRetriesForOperationTo(10)

.SetOperationTimeoutTo(TimeSpan.FromSeconds(30))

.OnClosed((c, _) => _status.OnNext(new ConnectionStatusMessage(false, c.ConnectionName)))

.OnConnected((c, _) => _status.OnNext(new ConnectionStatusMessage(true, c.ConnectionName)))

.OnDisconnected((c, _) => _status.OnNext(new ConnectionStatusMessage(false, c.ConnectionName)))

.OnErrorOccurred((c, e) => _logger.Warn(“EventStore error: {0}”,e.Message))

.OnReconnecting(c => _logger.Debug(“Reconnecting…”))

.SetDefaultUserCredentials(new UserCredentials(“admin”, “changeit”));

_status

.ObserveOn(NewThreadScheduler.Default)

.DistinctUntilChanged(m => m.GetHashCode())

.Subscribe(status =>

{

if (_closing)

return;

//Not relevant for current connection

if (_connection != null && _connection.ConnectionName != status.ConnectionName)

return;

if (status.IsConnected)

{

_gate.Set();

_logger.Info(“EventStore {0} connected”, status.ConnectionName);

_attempts = 0;

}

else

{

_gate.Reset();

_attempts++;

_logger.Warn(“EventStore {0} disconnected/closed, recreating connection, attempt {1}”, status.ConnectionName, _attempts);

try

{

if (_connection != null)

{

using (_connection)

{

_connection.Close();

}

}

}

catch

{

}

Task.Delay(TimeSpan.FromSeconds(Math.Max(_attempts, 30))).Wait();

Create();

}

}, () =>

{

if (_connection != null)

{

using (_connection)

{

_connection.Close();

}

}

});

Create();

}

private void Create()

{

_connection = EventStoreConnection.Create(_settings, _ipEndPoint);

_logger.Info(“EventStore {0} created”, _connection.ConnectionName);

_connection.Connect();

}

public IEventStoreConnection Connection(TimeSpan timeout)

{

if (!_gate.WaitOne(timeout))

throw new InvalidOperationException(String.Format(“Can not connect to EventStore”));

return _connection;

}

public void TriggerReconnect(string prevConnectionName)

{

if (_closing)

return;

_status.OnNext(new ConnectionStatusMessage(false, prevConnectionName));

}

public void Dispose()

{

_closing = true;

_gate.Set();

_status.OnCompleted();

}

private struct ConnectionStatusMessage

{

public readonly bool IsConnected;

public readonly string ConnectionName;

public ConnectionStatusMessage(bool isConnected, string connectionName)

{

IsConnected = isConnected;

ConnectionName = connectionName;

}

public override int GetHashCode()

{

return (ConnectionName + IsConnected.ToString()).GetHashCode();

}

public override bool Equals(object obj)

{

if (!(obj is ConnectionStatusMessage))

return false;

var that = (ConnectionStatusMessage)obj;

return this.ConnectionName == that.ConnectionName && this.IsConnected == that.IsConnected;

}

}

}

You are getting server originated heartbeat timeouts (you are setting the client ones…). The server occasionally will heartbeat the client to see if the socket is still alive. The connection should however if the connection is lost are you saying that your connection isnt reconnecting when its lost?

Well, I recreate the connection, but get another timeout within seconds.

Had issues earlier with reusing the same connection, so now i dispose and create a new.

There is logic in the connection itself to handle retries, we have not heard of anyone else having issues with that logic. Can you send a log client (verbose)/server where this is happens so we can take a look at it? What is the network setup in between the client and the server?

So even if I get a disconnect event, the connection will keep retrying? Must have misunderstood that early on, I thought the disconnect came after a number of retries. So the solution might be as simple as letting the connection manage this by itself then. I guess I ran into issues on an early version and draw the wrong conclusions…

Travelling at the moment, but will get log by the end of the week if the problem persists.

There is actually no network, client and server on same machine…

Yes the connection manages reconnections etc by itself (in the builder there are bunches of options for this like KeepReconnecting())

Hi Greg,

Just to add another data point we have seen similar things to what Peter is reporting. When I see it next I’ll try and get some logs for you as well.

cheers

Andrew Browne

Our company, too, has a service that runs for long periods of time, then gets a server originated heartbeat timeout disconnect causing the service to blow up in spectacular fashion. After (mere) hours we gave up trying to figure out the re-connection junk which didn’t seem to work reliably, and just allowed the service to tear-down and automatically restart. I’m sure this issue is ours and not GES, but I just wanted to add my data point as well.

What was your reconnection setup on the connection? Perhaps this is an issue or perhaps its a documentation/example issue but I would like to get to the bottom of it.

var connectionSettingsBuilder = EventStore.ClientAPI.ConnectionSettings.Create();

connectionSettingsBuilder.UseCustomLogger(new EventStoreLogger(logService));

connectionSettingsBuilder.SetDefaultUserCredentials(new EventStore.ClientAPI.SystemData.UserCredentials("[redacted]", “[redacted]”));

connectionSettingsBuilder.KeepReconnecting();

connectionSettingsBuilder.KeepRetrying();

var eventStoreConnection = EventStore.ClientAPI.EventStoreConnection.Create(connectionSettingsBuilder, new IPEndPoint(config.EventStoreIPAddress, config.EventStorePort));

The problem is very likely downstream from GES in our case. The reason we gave up is because the ‘unreproducability’ of this specific issue, short of pulling network cables. The brute-force restart, however inelegant, works in our case.

Even with a network outage on that you should get a reconnect from the client. With reproducing it I might try putting verbose logging on (it will tell you exactly whats happening in the client)

This is more than a little embarassing, but as I guess it’s only fair to admit your mistakes and not blame anyone else:

It seems like the heartbeat timeouts was caused by an buggy, unbound readstore query, consuming pretty much all server memory. Thats why the only way to stabilize connection again was to restart our service.

(In my defence, I always heard about this after the fact, and the only thing caught in logs was the eventstore problems….)

Sorry for wasting peoples time, but at the same time I’m glad I managed to sort out the connection behaviour, the code is radically simplified now…

Cheers, Peter