Sorry to sound like a broken record, but I just can’t get this working properly. Spent last night testing locally, I can recover from pretty much any scenario (stopping and restarting eventstore while running etc). But not from heartbeat timeouts (that I can’t recreate reliably, even when setting them to millisecond).
I reconnect, then after a little while I get another timeout. Until I restart my service/process, then it’s fine again. I can’t for the life of me understand what kind of state I have that causes this…
My subscription handlers just puts things on an internal queue, so there’s nothing blocking there either. Today it ran for 3-4 hours before acting up, the load at the beginning of the day was much higher, so I don’t think it’s related to that either.
The current plan is to just restart the service if I get a disconnect. Brute force recovery.
Environment:
Server has 8 cores and 16gb ram. Eventstore has a dedicated 10.000 rpm 1tb raid, with no other databases on it. (There are imagefile on the same disk but they’re rarely written. max 2-3 a minute)
Log:
[PID:05500:013 2014.07.19 12:12:32.105 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:59696, L127.0.0.1:1113, {854d442c-5fd9-42d6-9601-821dd6f1ad45}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 1179513
[PID:05500:013 2014.07.19 12:12:32.138 INFO TcpConnection ] [12:12:32.134: N127.0.0.1:59696, L127.0.0.1:1113, {854d442c-5fd9-42d6-9601-821dd6f1ad45}]:
Received bytes: 127253113, Sent bytes: 36216000645
Send calls: 1276943, callbacks: 1276943
Receive calls: 1179929, callbacks: 1179928
Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 1179513
[PID:05500:013 2014.07.19 12:12:32.157 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:59696, {854d442c-5fd9-42d6-9601-821dd6f1ad45}] closed: Success.
[PID:05500:003 2014.07.19 12:12:49.124 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61567, L127.0.0.1:1113, {948f147c-b5cd-460d-937a-2f3342f9818a}].
[PID:05500:003 2014.07.19 12:12:49.593 INFO TcpConnection ] [12:12:49.593: N127.0.0.1:61567, L127.0.0.1:1113, {948f147c-b5cd-460d-937a-2f3342f9818a}]:
Received bytes: 242, Sent bytes: 37307
Send calls: 4, callbacks: 4
Receive calls: 5, callbacks: 5
Close reason: [Success] Socket closed
[PID:05500:003 2014.07.19 12:12:49.593 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:61567, {948f147c-b5cd-460d-937a-2f3342f9818a}] closed: Success.
[PID:05500:003 2014.07.19 12:13:19.400 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61574, L127.0.0.1:1113, {3e8018ce-ca4b-41e7-9bc8-7a0ad003a5bf}].
[PID:05500:013 2014.07.19 12:17:25.445 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:61574, L127.0.0.1:1113, {3e8018ce-ca4b-41e7-9bc8-7a0ad003a5bf}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 272
[PID:05500:013 2014.07.19 12:17:25.445 INFO TcpConnection ] [12:17:25.445: N127.0.0.1:61574, L127.0.0.1:1113, {3e8018ce-ca4b-41e7-9bc8-7a0ad003a5bf}]:
Received bytes: 53707, Sent bytes: 3357323
Send calls: 403, callbacks: 403
Receive calls: 273, callbacks: 272
Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 272
[PID:05500:013 2014.07.19 12:17:25.445 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:61574, {3e8018ce-ca4b-41e7-9bc8-7a0ad003a5bf}] closed: Success.
[PID:05500:022 2014.07.19 12:18:11.437 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61613, L127.0.0.1:1113, {c82c591a-3394-4b61-900e-3cfb5ae47e57}].
[PID:05500:030 2014.07.19 12:18:36.883 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue #1]: ReadAllEventsForward - 500ms. Q: 0/0.
[PID:05500:012 2014.07.19 12:18:41.564 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [StorageWriterQueue]: WriteTransactionData - 720ms. Q: 0/2.
[PID:05500:013 2014.07.19 12:18:57.442 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:61613, L127.0.0.1:1113, {c82c591a-3394-4b61-900e-3cfb5ae47e57}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 1442
[PID:05500:013 2014.07.19 12:18:57.442 INFO TcpConnection ] [12:18:57.442: N127.0.0.1:61613, L127.0.0.1:1113, {c82c591a-3394-4b61-900e-3cfb5ae47e57}]:
Received bytes: 109227, Sent bytes: 45337342
Send calls: 1477, callbacks: 1477
Receive calls: 1443, callbacks: 1442
Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 1442
[PID:05500:013 2014.07.19 12:18:57.442 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:61613, {c82c591a-3394-4b61-900e-3cfb5ae47e57}] closed: Success.
[PID:05500:013 2014.07.19 12:19:08.317 TRACE InMemoryBus ] SLOW BUS MSG [Core Output]: ReadAllEventsForward - 404ms. Handler: RequestResponseQueueForwarder.
[PID:05500:012 2014.07.19 12:19:42.472 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 653ms. Q: 0/1.
[PID:05500:022 2014.07.19 12:20:02.375 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61658, L127.0.0.1:1113, {2a2f1909-42e3-4e13-8995-d0609afa436d}].
[PID:05500:013 2014.07.19 12:20:10.376 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:61658, L127.0.0.1:1113, {2a2f1909-42e3-4e13-8995-d0609afa436d}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 0
[PID:05500:013 2014.07.19 12:20:10.376 INFO TcpConnection ] [12:20:10.376: N127.0.0.1:61658, L127.0.0.1:1113, {2a2f1909-42e3-4e13-8995-d0609afa436d}]:
Received bytes: 0, Sent bytes: 22
Send calls: 1, callbacks: 1
Receive calls: 1, callbacks: 0
Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 0
[PID:05500:013 2014.07.19 12:20:10.376 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:61658, {2a2f1909-42e3-4e13-8995-d0609afa436d}] closed: Success.
[PID:05500:012 2014.07.19 12:20:14.614 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 3068ms. Q: 0/0.
[PID:05500:014 2014.07.19 12:20:14.692 TRACE InMemoryBus ] SLOW BUS MSG [MainBus]: RequestManagerTimerTick - 280ms. Handler: RequestManagementService.
[PID:05500:014 2014.07.19 12:20:14.692 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [MainQueue]: RequestManagerTimerTick - 348ms. Q: 0/5.
[PID:05500:031 2014.07.19 12:20:48.863 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61671, L127.0.0.1:1113, {58c2b74d-0429-4945-bb07-83f25b192409}].
[PID:05500:013 2014.07.19 12:20:56.864 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:61671, L127.0.0.1:1113, {58c2b74d-0429-4945-bb07-83f25b192409}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 0
[PID:05500:013 2014.07.19 12:20:56.864 INFO TcpConnection ] [12:20:56.864: N127.0.0.1:61671, L127.0.0.1:1113, {58c2b74d-0429-4945-bb07-83f25b192409}]:
Received bytes: 0, Sent bytes: 22
Send calls: 1, callbacks: 1
Receive calls: 1, callbacks: 0
Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 0
[PID:05500:013 2014.07.19 12:20:56.864 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:61671, {58c2b74d-0429-4945-bb07-83f25b192409}] closed: Success.
[PID:05500:033 2014.07.19 12:21:24.090 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61674, L127.0.0.1:1113, {4340b9de-199d-4407-8a82-4f630821be7b}].
[PID:05500:033 2014.07.19 12:21:27.140 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61675, L127.0.0.1:1113, {d30d50b9-2c5a-4419-8f97-ec8e5ec26db3}].
[PID:05500:013 2014.07.19 12:21:32.092 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:61674, L127.0.0.1:1113, {4340b9de-199d-4407-8a82-4f630821be7b}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 0
[PID:05500:013 2014.07.19 12:21:32.092 INFO TcpConnection ] [12:21:32.092: N127.0.0.1:61674, L127.0.0.1:1113, {4340b9de-199d-4407-8a82-4f630821be7b}]:
Received bytes: 0, Sent bytes: 22
Send calls: 1, callbacks: 1
Receive calls: 1, callbacks: 0
Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 0
[PID:05500:013 2014.07.19 12:21:32.092 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:61674, {4340b9de-199d-4407-8a82-4f630821be7b}] closed: Success.
[PID:05500:013 2014.07.19 12:21:35.142 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:61675, L127.0.0.1:1113, {d30d50b9-2c5a-4419-8f97-ec8e5ec26db3}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 0
[PID:05500:013 2014.07.19 12:21:35.142 INFO TcpConnection ] [12:21:35.142: N127.0.0.1:61675, L127.0.0.1:1113, {d30d50b9-2c5a-4419-8f97-ec8e5ec26db3}]:
Received bytes: 0, Sent bytes: 22
Send calls: 1, callbacks: 1
Receive calls: 1, callbacks: 0
Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 0
[PID:05500:013 2014.07.19 12:21:35.142 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:61675, {d30d50b9-2c5a-4419-8f97-ec8e5ec26db3}] closed: Success.
[PID:05500:012 2014.07.19 12:21:42.698 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 1062ms. Q: 0/0.
[PID:05500:016 2014.07.19 12:21:54.339 TRACE InMemoryBus ] SLOW BUS MSG [manager input bus]: RegularTimeout - 146ms. Handler: ProjectionManager.
[PID:05500:016 2014.07.19 12:21:54.339 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [Projections Master]: RegularTimeout - 148ms. Q: 0/1.
[PID:05500:015 2014.07.19 12:21:54.339 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 198ms. Q: 0/0.
[PID:05500:023 2014.07.19 12:23:34.975 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61745, L127.0.0.1:1113, {67f76943-8b67-4f3f-8081-407e3a857d72}].
[PID:05500:013 2014.07.19 12:23:37.183 TRACE InMemoryBus ] SLOW BUS MSG [Core Output]: ReadAllEventsForward - 56ms. Handler: RequestResponseQueueForwarder.
[PID:05500:016 2014.07.19 12:23:37.183 TRACE InMemoryBus ] SLOW BUS MSG [manager input bus]: RegularTimeout - 105ms. Handler: ProjectionManager.
[PID:05500:016 2014.07.19 12:23:37.183 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [Projections Master]: RegularTimeout - 109ms. Q: 0/1.
[PID:05500:015 2014.07.19 12:23:37.183 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [MonitoringQueue]: GetFreshStats - 153ms. Q: 0/0.
[PID:05500:012 2014.07.19 12:23:43.266 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 1568ms. Q: 0/0.
[PID:05500:013 2014.07.19 12:23:45.976 TRACE TcpConnectionManager] Closing connection ‘external-normal’ [127.0.0.1:61745, L127.0.0.1:1113, {67f76943-8b67-4f3f-8081-407e3a857d72}] cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 1
[PID:05500:013 2014.07.19 12:23:45.976 INFO TcpConnection ] [12:23:45.976: N127.0.0.1:61745, L127.0.0.1:1113, {67f76943-8b67-4f3f-8081-407e3a857d72}]:
Received bytes: 37, Sent bytes: 44
Send calls: 2, callbacks: 2
Receive calls: 2, callbacks: 1
Close reason: [Success] HEARTBEAT TIMEOUT at msgNum 1
[PID:05500:013 2014.07.19 12:23:45.983 INFO TcpConnectionManager] Connection ‘external-normal’ [127.0.0.1:61745, {67f76943-8b67-4f3f-8081-407e3a857d72}] closed: Success.
[PID:05500:012 2014.07.19 12:24:43.566 TRACE QueuedHandlerMRES ] SLOW QUEUE MSG [StorageWriterQueue]: WriteCommit - 1320ms. Q: 0/0.
[PID:05500:031 2014.07.19 12:25:04.517 INFO TcpService ] External TCP connection accepted: [Normal, 127.0.0.1:61764, L127.0.0.1:1113, {615ce44f-5db5-4c30-b334-93fcedc1e1aa}].
I isolated all connectionhandling to it’s own class (that became rather more complicated than I wanted it to, but I wanted to be 100% sure I didn’t fiddle with the connection on more than 1 thread at the time):
public class ConnectionHandler : IDisposable
{
private readonly ConnectionSettingsBuilder _settings;
private readonly Core.Log.ILogger _logger;
private readonly IPEndPoint _ipEndPoint;
private readonly ManualResetEvent _gate = new ManualResetEvent(false);
private IEventStoreConnection _connection = null;
private readonly Subject _status = new Subject();
private bool _closing = false;
private int _attempts = 0;
public ConnectionHandler(IPEndPoint ipEndPoint, Core.Log.ILogger logger)
{
_ipEndPoint = ipEndPoint;
_logger = logger;
_settings = ConnectionSettings.Create()
.UseCustomLogger(new EventStoreLog4netLogger())
.DisableVerboseLogging()
.FailOnNoServerResponse()
.SetHeartbeatInterval(TimeSpan.FromSeconds(10))
.SetHeartbeatTimeout(TimeSpan.FromMinutes(2))
.KeepRetrying()
.LimitRetriesForOperationTo(10)
.SetOperationTimeoutTo(TimeSpan.FromSeconds(30))
.OnClosed((c, _) => _status.OnNext(new ConnectionStatusMessage(false, c.ConnectionName)))
.OnConnected((c, _) => _status.OnNext(new ConnectionStatusMessage(true, c.ConnectionName)))
.OnDisconnected((c, _) => _status.OnNext(new ConnectionStatusMessage(false, c.ConnectionName)))
.OnErrorOccurred((c, e) => _logger.Warn(“EventStore error: {0}”,e.Message))
.OnReconnecting(c => _logger.Debug(“Reconnecting…”))
.SetDefaultUserCredentials(new UserCredentials(“admin”, “changeit”));
_status
.ObserveOn(NewThreadScheduler.Default)
.DistinctUntilChanged(m => m.GetHashCode())
.Subscribe(status =>
{
if (_closing)
return;
//Not relevant for current connection
if (_connection != null && _connection.ConnectionName != status.ConnectionName)
return;
if (status.IsConnected)
{
_gate.Set();
_logger.Info(“EventStore {0} connected”, status.ConnectionName);
_attempts = 0;
}
else
{
_gate.Reset();
_attempts++;
_logger.Warn(“EventStore {0} disconnected/closed, recreating connection, attempt {1}”, status.ConnectionName, _attempts);
try
{
if (_connection != null)
{
using (_connection)
{
_connection.Close();
}
}
}
catch
{
}
Task.Delay(TimeSpan.FromSeconds(Math.Max(_attempts, 30))).Wait();
Create();
}
}, () =>
{
if (_connection != null)
{
using (_connection)
{
_connection.Close();
}
}
});
Create();
}
private void Create()
{
_connection = EventStoreConnection.Create(_settings, _ipEndPoint);
_logger.Info(“EventStore {0} created”, _connection.ConnectionName);
_connection.Connect();
}
public IEventStoreConnection Connection(TimeSpan timeout)
{
if (!_gate.WaitOne(timeout))
throw new InvalidOperationException(String.Format(“Can not connect to EventStore”));
return _connection;
}
public void TriggerReconnect(string prevConnectionName)
{
if (_closing)
return;
_status.OnNext(new ConnectionStatusMessage(false, prevConnectionName));
}
public void Dispose()
{
_closing = true;
_gate.Set();
_status.OnCompleted();
}
private struct ConnectionStatusMessage
{
public readonly bool IsConnected;
public readonly string ConnectionName;
public ConnectionStatusMessage(bool isConnected, string connectionName)
{
IsConnected = isConnected;
ConnectionName = connectionName;
}
public override int GetHashCode()
{
return (ConnectionName + IsConnected.ToString()).GetHashCode();
}
public override bool Equals(object obj)
{
if (!(obj is ConnectionStatusMessage))
return false;
var that = (ConnectionStatusMessage)obj;
return this.ConnectionName == that.ConnectionName && this.IsConnected == that.IsConnected;
}
}
}