Constant heartbeat timeouts after a handful of events are processed

Denny_Crane · June 20, 2017, 6:34pm

Just looking for ideas of threads to pull on:

This seems to only be happening in our production environment now that we’re trying to deploy this new service. We have an existing service that’s subscribed to EventStore that isn’t having any issues, but I haven’t tried to do a replay so many this problem just isn’t presenting yet for that service.

I processed about half a million events into eventstore without issue, but my denormalizer is choking due to what appears to be a heartbeat timeout. Here’s what I’ve observed

When I restart my Query host and the denormalizer starts, it makes it a thousand or so events in before it stops. There are no application-side errors. This behavior is pretty consistent. If I restart the query host, it does the same thing: processes a couple thousand and then stops. I was able to verify that those records were making it into mongo
I added logging to the client connection to let me know when it disconnects and why. The reason it gives is “Reconnection Limit Reached”
I looked at the log an it appears to begin with a socket receive error (attached ES log from the window where the service was restarted) followed by heartbeat timeouts
There was 1 log in the *.err file that said “VERY SLOW QUEUE MSG” from about an hour prior. I’m not sure if that’s relevant
I increased the heartbeat timeout from 500 to 1000 with no change. I considered bumping it even higher, but at this point, it seems that there’s a bigger problem in play.
This works (mostly) without a problem in my dev environment.
The machine that is running ES in this case is a D2_V2 on azure (2 cores, 7GB mem)
I tested the query host on another machine and got the same result
I feel like this is network related, but I can’t figure out how or how to test it. It’s able to process a couple thousand per restart so I don’t think it’s a firewall issue. Any ideas?

eventstore_timeouts.txt (22.7 KB)

Denny_Crane · June 20, 2017, 6:41pm

Event Store 3.9.3.0

Hayley-Jean_Campbell · June 21, 2017, 8:38am

Can you turn on verbose logs on the client? That should give us more information about what is happening.
You can enable verbose logging on the connection settings:

ConnectionSettings settings = ConnectionSettings.Create().EnableVerboseLogging();

Also, can you give us a general idea of what your client is doing? Or provide a code sample if you are able.

Denny_Crane · June 21, 2017, 4:27pm

Yeah I can do that. Any idea how to set my implementation of ILogger so that I can cap those logs?

The client is just a CatchUpSubscription that subscribes to a category. The actual code is scattered across a few layers of abstraction, but this is the basic subscription setup:

public void CatchUpSubscription(
string streamName,
int lastSeenIndex,
Action<EventInfo, int> processAction,
int liveQueueSize = 10000,
int readBatchSize = 500,
bool resolveLinkTos = false)
{
_processAction = processAction;

int? startIndex = lastSeenIndex;
if (lastSeenIndex == -2)
startIndex = null;

Logger.Verbose($“Subscribing to {streamName} as catch-up subscription at index {lastSeenIndex}…”, this);
var settings = new CatchUpSubscriptionSettings(liveQueueSize, readBatchSize, false, resolveLinkTos);
var sub = _eventStore.SubscribeToStreamFrom(
streamName,
startIndex,
settings,
EventAppeared);

_disposableActionCollection.Add(() => sub.Stop());
}

``

Denny_Crane · June 21, 2017, 4:29pm

oh it’s part of the connectionsettings object. I’ll do that once I get a chance. Sprint ceremonies today… D:

Denny_Crane · June 21, 2017, 4:51pm

Nevermind, I guess I can’t initialize one of those…

Denny_Crane · June 21, 2017, 5:44pm

Ok attached are the logs from the server and from the client

ES_Client_TimingOut.txt (4.95 KB)

ES_Log_Disconnecting.txt (27.9 KB)

Hayley-Jean_Campbell · June 23, 2017, 9:08am

Could you shed some light on what this environment looks like? The ip addresses used by the client and server don’t seem to match up.

For example, in the client logs you provided, it looks like the client is trying to connect to 13.65.44.141:1113 from 10.50.4.5:63629.

Close reason: [Success] “EventStoreConnection ‘Reviews-Service-Query’: closing TCP connection [13.65.44.141:1113, 10.50.4.5:63629, bd7e8c40-1253-46c3-8b61-18a2e25194ad] due to HEARTBEAT TIMEOUT at pkgNum 4.”

However, the server appears to be at 10.50.4.4:1113, and connections are coming from different IP’s (I can’t find any that match up with the one the client is using.)

[PID:06160:045 2017.06.21 17:38:35.266 INFO TcpService ] External TCP connection accepted: [Normal, 13.65.213.97:63627, L10.50.4.4:1113, {6c87d6c7-1c3a-43a0-b3ba-62b68dc3a914}].

[PID:06160:045 2017.06.21 17:38:35.704 INFO TcpService ] External TCP connection accepted: [Normal, 13.65.213.211:50680, L10.50.4.4:1113, {43ef18ca-24dd-4050-aec9-800cf5c408b7}].

I’m interested in the following:

Are you running your Event Store in a cluster?
How is your client connecting to the Event Store? Using Cluster settings/gossip or directly to an IP address?
Are your nodes potentially running in an environment where their internal IP addresses would differ to their external ones?

Denny_Crane · June 23, 2017, 10:40pm

Absolutely:

Could you shed some light on what this environment looks like?

The server and client are on two different VMs. EventStore has a dedicated machine, which in this case is the IP with port 1113. The client is running inside of a windows service (nservicebus host) which is the other machine which is a VM dedicated to these kinds of services. There are two instances of the client machine. At the moment, I only have one of them running, but I was also switching them around to see if the problem existed on both instances, and it does.

Are you running your Event Store in a cluster?

The EventStore is not currently running in a cluster. It’s a single VM behind a load balancer. We have a pending task to place it into a cluster but we’re holding off until we can figure the current problem out.

How is your client connecting to the Event Store? Using Cluster settings/gossip or directly to an IP address?

Right now, the EventStore VM has a public url that the client is using to connect, so not directly to the IP address. The DNS is managed by Azure

Are your nodes potentially running in an environment where their internal IP addresses would differ to their external ones?

Kinda. The internal IP addresses are dynamic, but the client is connecting by the public URL which is constantly updated via DNS. I’m working with a Microsoft rep to try and determine whether or not this is some other sort of connectivity issue and he pointed out that the VM is behind a load balancer but we’re connecting to the URL that is assigned to the VM itself. A miss on our part (oops), but it shouldn’t matter as the A record is updated when DNS is updated. I’m not sure that would account for the consistency in these errors, though. Both client boxes behave the same way which is to let about a thousand xmits through before it begins to drop connections enough to cause the EventStore client to close the connect and never reopen it. We have a virtually identical setup in our dev environment with smaller VMs that do not exhibit this behavior so we’re all stumped at the moment. The only difference in dev is no load balancer, but we’re not connecting to that at the moment anyway.

Denny_Crane · June 29, 2017, 1:18am

Ok so it seems that the logs were misleading? I had the readBatchSize set to 1000 which was apparently too much. I lowered it 500 and everything started processing somewhat normally. I have an issue with MongoDB not keeping up, but that seems to be a different problem. So I guess I don’t understand a couple things. Any insight would help:

For a CatchUpSubscription, what are maxLiveQueueSize and readBatchSize and how do they work? I assumed that readBatchSize was the number of events that would be pulled in at a time when catching up with the event store and that maxLiveQueueSize were how large the queue could be when it’s running normally (i.e. after it’s done catching up). Not sure what the default values for either are. I was using 10,000 for live queue and 1000 for read batch (which was bad so I lowered it to 500). It could help to have some sort of guide that would help me understand how to tune these settings.