SubscribeToAllFrom() suddenly only works for first 500 events

Morten_Brix_Pedersen · April 15, 2014, 7:15am

We have been running an EventStore for the last 6 months, and now we need to re-generate our read models from the event streams.

This operation - which we have tried many times before, suddenly does not work. Only the first 500 events are read, and then we are seeing heartbeat timeouts.

Sample code can be seen here: https://gist.github.com/mbp/a8d7625ba280b10016e0

We are using this build: 3.0.0.0 (master/8ef23af341976d4f92580cdae28921bd0e325393, Wed, 15 Jan 2014 13:31:15 +0000)

I can only reproduce it on our production database. On my local database, there are no problems.

The output of the client application is as follows:

[14,07:13:21.127,DEBUG] Connected to [172.31.50.209:1113, L192.168.1.102:62196,{8155c3df-8b48-4456-8fe4-a3f0069b07ba}].

OnConnected

15-04-2014 07:13:22 EventAppeared 0

15-04-2014 07:13:22 EventAppeared 1

[manually trimmed output here]

15-04-2014 07:13:22 EventAppeared 498

15-04-2014 07:13:22 EventAppeared 499

[11,07:13:25.512,INFO] EventStoreConnection ‘ES-aee8292d-7860-4009-a181-37ce2e69e61b’: closing TCP connection [172.31.50.209:1113, L192.168.1.102:62196, {8155c3df-8b48-4456-8fe4-a3f0069b07ba}] due to HEARTBEAT TIMEOUT at pkgNum 2.

[11,07:13:25.515,INFO] [07:13:25.515: N172.31.50.209:1113, L192.168.1.102:62196, {8155c3df-8b48-4456-8fe4-a3f0069b07ba}]:

Received bytes: 4018530, Sent bytes: 133

Send calls: 4, callbacks: 4

Receive calls: 809, callbacks: 808

Close reason: [Success] EventStoreConnection ‘ES-aee8292d-7860-4009-a181-37ce2e69e61b’: closing TCP connection [172.31.50.209:1113, L192.168.1.102:62196, {8155c3df-8b48-4456-8fe4-a3f0069b07ba}] due to HEARTBEAT TIMEOUT at pkgNum 2.

[11,07:13:25.518,DEBUG] Connection [172.31.50.209:1113, L192.168.1.102:62196, {8155c3df-8b48-4456-8fe4-a3f0069b07ba}] was closed cleanly.

OnDisconnected

OnReconnecting

[17,07:13:25.758,DEBUG] Connected to [172.31.50.209:1113, L192.168.1.102:62197,{166adb6e-408c-4719-b063-9e4fefb6e94d}].

OnConnected

Greg_Young1 · April 15, 2014, 7:55am

What is said in the server logs? Can you enable verbose logging in the client as well?

Morten_Brix_Pedersen · April 15, 2014, 9:34am

Hi Greg,

You can see client (verbose) output and server output here:

https://gist.github.com/mbp/7af1181c503ecb095cc4

I tried copying the entire database from the server to my local machine. The problem does not occur here.

I have a theory it might have to do with our server on Amazon which uses a different timezone than my local machine. But I suppose this should not be a problem.

jen20 · April 15, 2014, 10:12am

The logs show your connection being closed after the first 500 events are received over the subscription due to a heartbeat timeout:

[04,09:28:56.218,INFO] EventStoreConnection 'ES-a126c06e-01c6-400a-b13f-c24ca02b5193': closing TCP connection [172.31.50.209:1113, L192.168.1.102:52360, {da752d75-ab91-4392-91df-8faa6b93d6a2}] due to HEARTBEAT TIMEOUT at pkgNum 2.

[04,09:28:56.218,DEBUG] EventStoreConnection 'ES-a126c06e-01c6-400a-b13f-c24ca02b5193': CloseTcpConnection.

[04,09:28:56.221,INFO] [09:28:56.221: N172.31.50.209:1113, L192.168.1.102:52360, {da752d75-ab91-4392-91df-8faa6b93d6a2}]:

Received bytes: 4741980, Sent bytes: 133

Send calls: 4, callbacks: 4

Receive calls: 978, callbacks: 977

The reconnections then fail for the same reason.

Is this a a particularly small VM on Amazon? What is your heartbeat timeout configured as?

James

Greg_Young1 · April 15, 2014, 10:16am

That should not be a problem what could be a problem is you not reading the events fast enough over the tcp pipe from amazon. What happens with a smaller batch size?

Morten_Brix_Pedersen · April 15, 2014, 10:26am

Hi James,

Aha. I am using the default heart beat timeout (750 milliseconds). Increasing this value fixes the issue!

Yes it’s the smallest instance on Amazon, so I suppose this is what causes it (a test system, running for 6 months). We will be using a much greater instance when we go into production.

It would be nice if the heartbeat timeout “reason” was actually passed on to “OnDisconnected” handler, or something similar.