Just looking for ideas of threads to pull on:
This seems to only be happening in our production environment now that we’re trying to deploy this new service. We have an existing service that’s subscribed to EventStore that isn’t having any issues, but I haven’t tried to do a replay so many this problem just isn’t presenting yet for that service.
I processed about half a million events into eventstore without issue, but my denormalizer is choking due to what appears to be a heartbeat timeout. Here’s what I’ve observed
- When I restart my Query host and the denormalizer starts, it makes it a thousand or so events in before it stops. There are no application-side errors. This behavior is pretty consistent. If I restart the query host, it does the same thing: processes a couple thousand and then stops. I was able to verify that those records were making it into mongo
- I added logging to the client connection to let me know when it disconnects and why. The reason it gives is “Reconnection Limit Reached”
- I looked at the log an it appears to begin with a socket receive error (attached ES log from the window where the service was restarted) followed by heartbeat timeouts
- There was 1 log in the *.err file that said “VERY SLOW QUEUE MSG” from about an hour prior. I’m not sure if that’s relevant
- I increased the heartbeat timeout from 500 to 1000 with no change. I considered bumping it even higher, but at this point, it seems that there’s a bigger problem in play.
- This works (mostly) without a problem in my dev environment.
- The machine that is running ES in this case is a D2_V2 on azure (2 cores, 7GB mem)
- I tested the query host on another machine and got the same result
I feel like this is network related, but I can’t figure out how or how to test it. It’s able to process a couple thousand per restart so I don’t think it’s a firewall issue. Any ideas?
eventstore_timeouts.txt (22.7 KB)