Having a hard time tracking down ObjectDisposed exception

Denny_Crane · October 8, 2017, 5:51pm

I’ve been struggling with this for a while now. I have a service that will run just fine while I’m testing it, but when left to run on a schedule, it will always end up throwing an ObjectDisposed exception in my service bus logs and I can’t quite figure why.

MesCannot access a disposed object.Object name: ‘Reviews-Service-Command’.

Reviews-Service-Command is the connectionName to the EventStore connection.

The EventStore connection is a DI’d as a singleton, and there is nowhere in the code where it is ever explicitly told to close or dispose, so the only think I can think of is that it is closing and disposing itself. I added logging for Connected, Disconnected, Reconnecting, Closed, and ErrorOcurred events but the only log that I can find is from the Closed event. The reason states that it is because the max reconnect limit has been exceeded, but I can’t find any other logs that even say it was being disconnected in the first place (not that it matters because that event doesn’t seem to offer a reason).

So I’m looking for some threads to pull on

Other Info:

Hosted on Azure on a Linux VM
Heartbeat Timeout set to 1000ms
Verbose logging is currently not on because this service is relatively high volume and would fill up my log files too quick to find anything useful. I would expect failing conditions that result in a disconnect to be output on the Error channel – I’m going to see if I can’t
In the meanwhile, I’m going to attempt capture a verbose log, but the last time I did this, I only found heartbeat timeouts which seem to happen pretty often even when ever thing is working … very confusing.

Denny_Crane · October 9, 2017, 1:38am

Ok I captured a log of the problem happening but I can’t see any errors. It just stops working, and retries start piling up until they go exceeded and then the connection drops. I don’t understand. The log is about 500MB though so I’m not sure what to do…

Denny_Crane · October 9, 2017, 1:39am

I’ve seen something similar to this that happens with the MongodDB driver during denormalization in the query model, but that has been resolved. The problem there was keep alives. Is there something like this that I need to configure for EventStore?

Denny_Crane · October 9, 2017, 2:07am

I think I finally found the part where it starts to fail. It’s not an error though, just a debug log statement:

DASAPWEB02 [DBG 2017-10-09 00:11:44.399 +00:00] [Thread 38] EventStoreConnection ‘Reviews-Service-Command’: operation never got response from server.

This seems to be the beginning of the end. From this point on, the retries get scheduled and eventually exceed the limit and the connection drops. What would cause this???

Hayley-Jean_Campbell · October 9, 2017, 7:26am

Hi Denny,

Did the server have any corresponding logs to the log that said there was no response from the server?

What operation was it that didn’t get a response?

Is this a clustered environment?

If you would like, you can attach the verbose client logs here or email them to hayley at geteventstore.com and we can look through them for anything that might be causing this.

Denny_Crane · October 9, 2017, 5:53pm

Did the server have any corresponding logs to the log that said there was no response from the server?
– I’ll work on getting those. The last time I looked in there, the only thing I ever saw was a heartbeat timeout but those were all over the place, even in non-error scenarios.
What operation was it that didn’t get a response?

– It appears to be ReadStreamEventsForwardOperation

Is this a clustered environment?

– Yes, there are 3 nodes

Log file: https://drive.google.com/open?id=0B39U6ch48uVAcVdmMzRmVU5kUW8
It is a 10M 7z file that decompresses to 500MB. It is encrypted. I will email you the password.

Denny_Crane · October 9, 2017, 6:06pm

Here are the server logs for each node. Same password.

https://drive.google.com/open?id=0B39U6ch48uVANndvV1ZMVW80RUE

I see some errors in there but I don’t know what they mean, if there anything to be concerned about, or what to do about them.

Thanks!

Denny_Crane · October 9, 2017, 6:14pm

You’re probably going to ask about this error:

“The ‘SurveyAccountToClient’ projection faulted due to 'The given checkpoint is invalid. Possible causes might include having written an event to the projection’s managed stream.”

That was a projection that was written a long time ago to essentially rename the stream. I’m not sure if there was a goof up back then or if this would bust all streams going forward. Where I’m seeing the connection drop error is in a service that does not read from this stream above. That being said, I don’t know what it means by “the projection’s managed stream”. The projection basically copies events from ‘SurveyAccount’ to a stream called ‘SurveyClient’. Going forward, all new entities get written to ‘SurveyClient’. Is that a problem? Do you think this is even related?

Hayley-Jean_Campbell · October 10, 2017, 7:41am

Hi Denny,

Can you give us an idea of how your client executes reads?
Does it wait for a response to come back before executing the next one?
Does it perform any other actions? Subscriptions, writes, etc.
How large is the stream you are trying to read from? How large on average are the events in the stream?

The reason I am asking is because there appears to be a flood of the same read being sent to the server.

As you noted previously, there are a large number of errors stating “Operation never got response from the server”.

However, these operations aren’t the same operation being retried over and over, but a large number of different read requests all being retried at the same time.

You can tell this by the fact that the correlation ids are different, and the retry count is not incrementing.

Another thing to note is that the requests have a MaxCount of 1,000. Trying to read that many events could be causing some issues depending on the size of the events.

UTC now: 00:13:18.323, operation: Operation ReadStreamEventsForwardOperation (1e76ebb6-5ad4-4fc1-bd76-276369dd0938): Stream: XXXXXXX, FromEventNumber: 0, MaxCount: 1000, ResolveLinkTos: True, RequireMaster: True, retry count: 8, created: 00:09:33.990, last updated: 00:13:10.619.
DASAPWEB02 [DBG 2017-10-09 00:13:18.323 +00:00] [Thread 80] EventStoreConnection ‘Reviews-Service-Command’: operation never got response from server.
UTC now: 00:13:18.323, operation: Operation ReadStreamEventsForwardOperation (350dda6b-6927-4171-82d7-2d4f1ed7941f): Stream: XXXXXXX, FromEventNumber: 0, MaxCount: 1000, ResolveLinkTos: True, RequireMaster: True, retry count: 8, created: 00:09:33.990, last updated: 00:13:10.619.
DASAPWEB02 [DBG 2017-10-09 00:13:18.323 +00:00] [Thread 80] EventStoreConnection ‘Reviews-Service-Command’: operation never got response from server.
UTC now: 00:13:18.323, operation: Operation ReadStreamEventsForwardOperation (105d27be-f10a-453b-8851-1ab76e75cf21): Stream: XXXXXXX, FromEventNumber: 0, MaxCount: 1000, ResolveLinkTos: True, RequireMaster: True, retry count: 8, created: 00:09:33.990, last updated: 00:13:10.619.
DASAPWEB02 [DBG 2017-10-09 00:13:18.323 +00:00] [Thread 80] EventStoreConnection ‘Reviews-Service-Command’: operation never got response from server.

``

With regards to the projection - it’s unlikely that your projection failure is causing this issue. However, you can troubleshoot that as well if you’d like.

The first thing to check is whether someone has written to one of the projection’s internal streams.

An example would be the checkpoint stream i.e $projections-{projection_name}-checkpoint

Denny_Crane · October 10, 2017, 4:29pm

Can you give us an idea of how your client executes reads?

In this particular model (command), it simply does a lookup by ID. One for every save request (each message that comes in through the message bus could lead to 0 or more save requests). The number of save requests varies throughout the day; it could be 30, it could be 2500.

Does it wait for a response to come back before executing the next one?

It does not wait. Everything is executed asynchronously and requests come in via the message bus.

Does it perform any other actions? Subscriptions, writes, etc.

This particular model (command) will always look up a stream before attempting to write to it. If the stream exists, no write is performed; if the stream does not exist, then it will write it.
Currently, there are 2 instances of this service to help process a large queue. The queue gets up around 8k messages 6 times a day.
On the Query side (different process), there is a subscription that pulls these events in order to denormalize them into a database.

How large is the stream you are trying to read from? How large on average are the events in the stream?

The individual stream is not very big. Usually, it only contains 1 or 2 events; usually just 1 which is ReviewCreated. The size of the message are no more than 2KB

"With regards to the projection - it’s unlikely that your projection failure is causing this issue. However, you can troubleshoot that as well if you’d like.

The first thing to check is whether someone has written to one of the projection’s internal streams.

An example would be the checkpoint stream i.e $projections-{projection_name}-checkpoint"

If that’s the case, I’ll hold off on that for now. One thing at a time

Thanks!

Hayley-Jean_Campbell · October 11, 2017, 8:40am

Hi Denny,

What version client are you using? And what version of Event Store?

From the information you’ve given me, it seems like your client may be flooding the Event Store with too many requests at once.

The client does not queue operations internally, but rather sends them off to the server as soon as it is able.

Would you be able to try queuing up the reads and writes rather than executing them immediately?

As a side note, if you are only reading to determine if a stream exists, you could perform a write with an expected version of -1 instead of reading every time.

If the stream does not exist, the write will succeed, otherwise the write will fail.

Hayley-Jean_Campbell · October 11, 2017, 9:15am

My apologies, I was mistaken regarding the internal queue in the client.
The client does queue requests internally, and there are settings available to control how many items in the queue can be in flight at the same time.

Could you please try lowering the MaxConcurrentItems setting when building your connection? By default it’s set to 5,000.

Denny_Crane · October 11, 2017, 4:35pm

Client Version: 4.0.3
Server Version: 4.0.1.0

What would you suggest I set the MaxConcurrenItems to? Is there a formula? It seems strange that it wouldn’t be able to keep up when there are 3 nodes running in D2 Linux VMs in Azure. I mean, I can see that slowing down the reads so reducing those to mostly just writes could help… I’ll try lowering it to 2500 and monitor it. It’s a bit problematic right now because it causes other services to start dropping connection as well and I can only recreate this in production which makes it a bit risky. So if I set an expected version of -1, will it ignore it if it already exists, or will it throw? I’ll have to update my abstraction to use it.

Here are some other notes:

This problem doesn’t exhibit itself in our dev environment, however our dev environment is only running 1 ES node instead of 3.
The problem also doesn’t exhibit itself on first run. For example if I reset the service and let it do a full run, it’ll complete with no problems. It’s subsequent runs that begin to fail.
I have encountered a similar issue with MongoDB where connection timeouts were fighting each other. Azure starts disposing inactive connections after a few minutes, but the definition of “inactive connection” seems to be shaky as connections would begin to timeout in the middle of a long running operation after about 10 minutes. The way we were able to fix it was to increase the default timeout times in Azure and enable keep-alive on the VMs along with a keep alive duration. We also had to enable a socket level keep alive that was available in the mongo c# client. I’m wondering if maybe this could be a similar issue that I’m seeing with EventStore?? Is there a keep-alive on by default in the EventStore.ClientAPI? Are there any other thoughts along these lines?

Hayley-Jean_Campbell · October 13, 2017, 2:02pm

Hi Denny,

Apologies for taking so long to respond.

If you write with expected version -1 and the stream already exists, the write will throw a WrongExpectedVersion exception.

How are you trying to replicate this issue in dev?

Running 3 nodes vs 1 node does make a difference. For one, the more nodes you have, the more nodes a write needs to be replicated to before a write is considered successful.

There is also more happening over the network, such as gossips and replication.

Are you in a position to upgrade your Event Store?

In 4.0.2 and 4.0.3 there have been some improvements that aim to prevent clients from putting too much pressure on the server.

Part of this improvement was to drop requests that were sent from connections that are now disconnected, where previously Event Store would continue processing those requests.

Upgrading may alleviate some of the pressure on your Event Store.

Denny_Crane · October 13, 2017, 2:38pm

We can probably upgrade in a month or two. Did you give any thought to my comment on the Azure connection timeouts? This behavior feels very indicative of the connection disposal issue we had with Mongo. As I said before, we ultimately had to enable a socket level keep alive on the mongo client to solve it.

Greg_Young1 · October 13, 2017, 2:46pm

Event Store already uses protocol level keepalives on connections, this is likely unrelated.

Denny_Crane · October 22, 2017, 12:56am

ok so we just installed 4.0.3 on a google compute node, single node and the problem is still occurring. It happens after a few minutes of running, so I don’t think it’s a replication lag issue. I’m going to try limiting the concurrent operations to 2500

Denny_Crane · October 22, 2017, 1:04am

That doesn’t seem to help either. I’m still getting a slew of errors stating that the client never got a response from the server.

Denny_Crane · October 22, 2017, 1:11am

and we can see the server log saying that the connection to the client was lost, but I’m watching the client still running through retries and then eventually throwing tons the retries limit exceed exceptions.

Hayley-Jean_Campbell · October 23, 2017, 6:03am

Hi Denny,

Could you please send the latest client and server logs to me if you can?