Odd SubscribeToAllFrom behavior

When I get all events using:

var settings = new CatchUpSubscriptionSettings(200, 1, true, false);

var catchup = _eventStoreClient.Connection.SubscribeToAllFrom(

null,

settings,

(sub, evt) =>

EventStoreOnEventCatchup(sub, evt),

null,

(sub, reason, ex) =>

EventStoreOnSubscriptionDroppedCatchup(sub, reason, ex),

_eventStoreClient.Credentials);

``

… I seem to be receiving all events, all expected event types, just fine.

However, when I try to get slices to operate on asynchronously and in batched workloads against an external service, and then use the last slice’s Position as so:

var indexPosition = new Position(0, 0); // getIndexPosition();

ServiceTrace.Trace.Verbose(EventIds.ProcessingStart, () => “BeginIndexing: Subscribe to event store notifications”);

AllEventsSlice currentSlice;

int slice = 1;

do

{

currentSlice = _eventStoreClient.Connection.ReadAllEventsForwardAsync(indexPosition, _eventStoreConfig.EventStoreReadBatchSize, false, _eventStoreClient.Credentials).Result;

ServiceTrace.Trace.Verbose(EventIds.ProcessingInfo, () => string.Format(“Slice #: {0}”, slice));

indexPosition = currentSlice.NextPosition;

try

{

var eventsToIndex = currentSlice.Events

.Where(e => null != e.OriginalEvent &&

null != e.OriginalEvent.EventType &&

!_elasticClient.TypesToFilter.Contains

(e.OriginalEvent.EventType));

Task.Run(() => _elasticClient.IndexEvents(eventsToIndex.ToList()));

}

catch (AggregateException aex)

{

ServiceTrace.Trace.Error(EventIds.ProcessingError, () => string.Format(“Aggregate exception while processing slice #{0}.”, slice), aex);

}

catch (Exception ex)

{

ServiceTrace.Trace.Error(EventIds.ProcessingError, () => string.Format(“Exception while processing slice #{0}.”, slice), ex);

}

slice++;

} while (!currentSlice.IsEndOfStream);

ServiceTrace.Trace.Information(EventIds.ProcessingInfo, () => “Catchup complete. Subscribing to new events.”);

var settings = new CatchUpSubscriptionSettings(200, 1, true, false);

var catchup = _eventStoreClient.Connection.SubscribeToAllFrom(

indexPosition,

settings,

(sub, evt) =>

EventStoreOnEventCatchup(sub, evt),

null,

(sub, reason, ex) =>

EventStoreOnSubscriptionDroppedCatchup(sub, reason, ex),

_eventStoreClient.Credentials);

``

… I seem to only receive a subset of event types. All the event types show up in slices, but after all slices are read, the subscription subsequent to that only surfaces limited event types. I restart the service, and new events that were missed in the subscription, now show up in the catch-up slices.

Why am I seeing this behavior? What am I missing?

Disregard, this was based on bad feedback from developers. I wrote a test harness to demonstrate it was not in fact the case.

Or not. It looks like we eventually run into some kind of error after the second subscription runs for a period of time. I believe this corresponds with loss of event flow. Verbose logging on, this is from the master node:

ERR:

[PID:05172:011 2016.08.29 17:23:12.457 ERROR QueuedHandlerMRES ] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 15140ms. Q: 0/10.

[PID:05172:011 2016.08.29 17:30:27.039 ERROR QueuedHandlerMRES ] —!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares - 15421ms. Q: 0/12.

[PID:05172:070 2016.08.29 21:02:15.215 INFO MiniWeb ] Error while replying from MiniWeb

System.Net.HttpListenerException (0x80004005): The I/O operation has been aborted because of either a thread exit or an application request

at System.Net.HttpResponseStream.EndWrite(IAsyncResult asyncResult)

at EventStore.Transport.Http.AsyncQueuedBufferWriter.EndWrite(IAsyncResult ar) in c:\projects\eventstore-commercialha\oss\src\EventStore.Transport.Http\AsyncQueuedBufferWriter.cs:line 116

cluster-node:

[PID:05172:032 2016.08.29 21:05:25.982 INFO TcpConnectionManager] Connection ‘external-normal’ [10.100.14.183:14567, {f2ee3e96-581c-469a-8098-58719087b47a}] closed: Success.

[PID:05172:071 2016.08.29 21:05:25.982 DEBUG PersistentSubscripti] Lost connection from 10.100.14.183:14567

[PID:05172:040 2016.08.29 21:05:46.873 INFO TcpService ] External TCP connection accepted: [Normal, 10.100.196.7:19326, L10.100.12.199:8080, {d07bc8f6-baeb-4bd6-8679-fa4236fe22f9}].

[PID:05172:072 2016.08.29 21:05:48.311 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue #3]: ReadAllEventsForward - 359ms. Q: 0/3.

[PID:05172:071 2016.08.29 21:05:48.951 TRACE InMemoryBus ] SLOW BUS MSG [Worker #5 Bus]: TcpSend - 640ms. Handler: TcpSendService.

[PID:05172:071 2016.08.29 21:05:48.951 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [Worker #5]: TcpSend - 640ms. Q: 0/4.

[PID:05172:040 2016.08.29 21:05:50.936 DEBUG HttpSendService ] Error occurred while replying to HTTP with message EventStore.Core.Messages.GossipMessage+SendGossip: The specified network name is no longer available.

[PID:05172:069 2016.08.29 21:05:52.451 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue #2]: ReadAllEventsForward - 2765ms. Q: 0/4.

[PID:05172:046 2016.08.29 21:05:52.983 INFO TcpConnectionMonitor] # 10.100.196.7:19326 58960kb pending send

[PID:05172:052 2016.08.29 21:05:53.264 TRACE InMemoryBus ] SLOW BUS MSG [Worker #5 Bus]: TcpSend - 812ms. Handler: TcpSendService.

[PID:05172:052 2016.08.29 21:05:53.264 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [Worker #5]: TcpSend - 812ms. Q: 0/11.

[PID:05172:075 2016.08.29 21:05:53.264 DEBUG HttpSendService ] Error occurred while replying to HTTP with message EventStore.Core.Messages.GossipMessage+SendGossip: The specified network name is no longer available.

[PID:05172:025 2016.08.29 21:05:54.701 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue #1]: ReadAllEventsForward - 343ms. Q: 0/1.

[PID:05172:028 2016.08.29 21:05:55.014 TRACE InMemoryBus ] SLOW BUS MSG [Worker #5 Bus]: TcpSend - 312ms. Handler: TcpSendService.

[PID:05172:028 2016.08.29 21:05:55.014 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [Worker #5]: TcpSend - 312ms. Q: 0/1.

[PID:05172:025 2016.08.29 21:05:57.967 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [StorageReaderQueue #4]: ReadAllEventsForward - 1812ms. Q: 0/3.

[PID:05172:028 2016.08.29 21:05:58.327 TRACE InMemoryBus ] SLOW BUS MSG [Worker #5 Bus]: TcpSend - 359ms. Handler: TcpSendService.

[PID:05172:028 2016.08.29 21:05:58.327 TRACE QueuedHandlerThreadP] SLOW QUEUE MSG [Worker #5]: TcpSend - 359ms. Q: 0/3.

… and so on for rest of log.

What are you using for the node?

[PID:05172:011 2016.08.29 17:23:12.457 ERROR QueuedHandlerMRES ]
---!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares -
15140ms. Q: 0/10.
[PID:05172:011 2016.08.29 17:30:27.039 ERROR QueuedHandlerMRES ]
---!!! VERY SLOW QUEUE MSG [StorageWriterQueue]: WritePrepares -
15421ms. Q: 0/12.
15 seconds to do a disk write?

This will definitely cause the node to stop writing events (they are
written synchronously)

Note the rest look like networking errors.

Windows Server 2012R2 x64 HyperV VM against a SAN (and we’ve seen issues against the SAN in other cases in the past.) 8GB RAM, 4 virtual processors.

When you say the rest look like networking errors, I’m curious about the network name no longer being available in particular. If all nodes and clients were apparently online, didn’t go offline, are still online now, where might this be coming from?

When we restart our client, the issue appears to go away for a while.

"When you say the rest look like networking errors, I'm curious about
the network name no longer being available in particular. If all nodes
and clients were apparently online, didn't go offline, are still
online now, where might this be coming from?"

This just means the socket was broken and a single http request failed.

While I await an infrastructure resource to help me with VM performance, I’ve continued monitoring and testing, and wanted to review/confirm some observations.

After a period of time, we seem to start losing at least some events. Not certain yet if it is 100% of events or not. No error log on master node, but we have slow queue msg entries at some point prior to this occurring. Not sure if they start at the exact same time as the event loss or not.

I can find an event (via Stream Browser) just written into Event Store that does not seem to be surfaced in the subscription. Does that make sense, given the other details discussed so far?

Oh, and if it matters, the subscription, at least during present testing, stops at position 2719763074. Says it remains subscribed to all, client hasn’t disconnected or closed, but the position doesn’t change. Checking in 5 second intervals, condition has held true since the time event flow degraded.

Can I get a confirmation as to whether or not these observations fit with the very slow queue message or not? This is a major issue for us. Do I need to go ahead an open a ticket?

No these do not fit in. I would need to see way more information. What
I can tell you is that if you do a subscription (all / stream) and you
are having network problems which can cause back pressure to hit and
the subscription to be dropped. You should receive a subscription
dropped in this case.

Feel free to open a ticket.