Event store projections failing on a regular basis

Hello everybody,

We have started using Event Store for one of our projects at the beginning of this year.

We rely extensively on the projections in order to limit the scope of work to be done by downstream activities.

Unfortunately, we have noticed that the projections engine fails regularly, and requires that we reboot the Event store.

We are currently using Event Store 4.0.0.

Did anybody face the same issue? After performing searches on Google nothing really came up in the results.

I am joining a print screen of our dashboard.

Best regards

Omar

Auto Generated Inline Image 1.png

We had a similar problem, but it turned out one of our custom projections was doing something pretty bad, causing a knock on effect, rather than the ES itself being the source of the problem.
Could you have a similar issue?

Hello Steven,

Our projections try to create a link to existing projections in order to avoid a bloated Event store.

I am joining a sample of our projections hoping you see something that we were not able to see ourselves.

The store behaves properly until all of a sudden the projections seems to fail…

Best regards

Omar

Sample of Queries

fromStreams(’$ce-{Component}’,
‘$ce-{Geometry}’)
.when(
{
ComponentCreatedEvent:function(state, event){
linkTo(ComponentModified, event);
return state;
},
GeometryCreatedEvent:function(state, event){
if(event.body.SignatureData !== null && event.body.SignatureData.Length !==0) {
linkTo(ComponentModified, event);
state.count += 1;}
return state;
},
GeometrySignatureExtractedEvent:function(state, event){
linkTo(ComponentModified, event);
return state;
},
GeometryExchangeFileExtractedEvent:function(state, event){
linkTo(ComponentModified, event);
return state;
},
GeometryGroupedEvent:function(state, event){
linkTo(ComponentModified, event);
return state;
},
}
);
fromCategory(‘GeometricalGroup’)
.foreachStream()
.when(
{
GeometricalGroupCreatedEvent:function(state, event){
linkTo(GroupsModified, event);
return state;
},
GeometricalGroupVisuAddedEvent:function(state, event){
linkTo(GroupsModified, event);
return state;
},
GeometryGroupedEvent:function(state, event){
linkTo(GroupsModified, event);
return state;
},
}
);

fromStreams(’$ce-{Supplier}’,
‘$ce-{Family}’,
‘$ce-{Component}’,
‘$ce-{Geometry}’)
.when(
{
SupplierCreatedEvent:function(state, event){
linkTo(SupAndFamModified}, event);
return state;
},
FamilyCreatedEvent:function(state, event){
linkTo(SupAndFamModified, event);
return state;
},
ComponentCreatedEvent:function(state, event){
linkTo(SupAndFamModified, event);
return state;
},
GeometryCreatedEvent:function(state, event){
if(event.body.BoundingBox !== null) {
linkTo(SupAndFamModified, event); }
return state;
},
GeometrySignatureExtractedEvent:function(state, event){
linkTo(SupAndFamModified, event);
return state;
},
}
);

Could your projection be writing to another Stream which was created outside this projection?

I am not sure I understand your ** properly.

Some of our projections create new streams. By they don’t write to streams they have not created themselves.

Omar
P.S.

When we create the projections, we use the following arguments
deleteStateStream= true
deleteCheckpointStream= true
deleteEmittedStreams= true

OK, that’s what I meant.
I was worried something else had created the stream, and this projection was trying to write to it.

It’s hard for me to guess what the real problem is, but perhaps you can disable some of the projections, and slowly start enabling them until the problem happens again?

Right now I have only one single machine (That is backed up every night)

It is used for our production environment, so stopping some projections and starting them one by one until the server fails does not guarantee that it will be reproducible, as we continue adding more data to the event store.

I am open to other suggestions though :o)

Omar

Sometimes when I have had a faulted projection, I can click the Projection and it informs me of an error that occurred.
Have you seen anything like that?

If the Projection has Faulted, and caused the By-Category to also fails, it strongly suggests it’s a Write to a Stream that ultimately causes the problem.

Sorry I can’t be much more help.

What is the reason for the failure? clicking on the projection in the UI will provide you with more details as to what the cause of the failure is.

Hey Steven

Thanks for your help.

So far. after looking at the error messages, we did not see any helpful message. I will be on the lookout.

I noticed in the logs some messages such as the message below.

The machine is a Windows machine, with an SSD hard drive. It has 8 Gigs of memory, Event Store occupies 1.8G.

I have a hunch that it is a timeout setting (for the projection computation) but don’t know how to change it, nor why it is happening.
Auto Generated Inline Image 1.pngAuto Generated Inline Image 2.pngAuto Generated Inline Image 3.png

[PID:03884:028 2017.05.28 22:50:21.380 ERROR QueuedHandlerThreadP] —!!! VERY SLOW QUEUE MSG [StorageReaderQueue #4]: ReadStreamEventsForward - 7966ms. Q: 0/21.
[PID:03884:020 2017.05.28 22:51:18.514 ERROR QueuedHandlerThreadP] —!!! VERY SLOW QUEUE MSG [StorageReaderQueue #1]: ReadStreamEventsForward - 12010ms. Q: 0/13.
[PID:03884:017 2017.05.28 22:52:00.970 ERROR QueuedHandlerThreadP] —!!! VERY SLOW QUEUE MSG [StorageReaderQueue #3]: ReadStreamEventsForward - 12160ms. Q: 0/15.
[PID:03884:007 2017.05.28 22:52:52.474 ERROR QueuedHandlerThreadP] —!!! VERY SLOW QUEUE MSG [StorageReaderQueue #3]: ReadStreamEventsForward - 12658ms. Q: 0/19.
[PID:03884:013 2017.05.28 22:53:36.567 ERROR QueuedHandlerThreadP] —!!! VERY SLOW QUEUE MSG [StorageReaderQueue #4]: ReadStreamEventsForward - 11805ms. Q: 0/16.
[PID:03884:032 2017.05.28 22:54:20.106 ERROR QueuedHandlerThreadP] —!!! VERY SLOW QUEUE MSG [StorageReaderQueue #4]: ReadStreamEventsForward - 9863ms. Q: 0/19.
[PID:03884:020 2017.05.28 22:54:59.468 ERROR QueuedHandlerThreadP] —!!! VERY SLOW QUEUE MSG [StorageReaderQueue #3]: ReadStreamEventsForward - 9085ms. Q: 0/9.
[PID:03884:020 2017.05.28 22:55:41.536 ERROR QueuedHandlerThreadP] —!!! VERY SLOW QUEUE MSG [StorageReaderQueue #4]: ReadStreamEventsForward - 8697ms. Q: 0/16.
[PID:03884:007 2017.05.28 22:56:19.605 ERROR QueuedHandlerThreadP] —!!! VERY SLOW QUEUE MSG [StorageReaderQueue #4]: ReadStreamEventsForward - 8578ms. Q: 0/19.
[PID:03884:021 2017.05.28 22:57:00.713 ERROR QueuedHandlerThreadP] —!!! VERY SLOW QUEUE MSG [StorageReaderQueue #4]: ReadStreamEventsForward - 8620ms. Q: 0/12.

Hello Pieter,

Is it possible to retrieve such an error from the log files.

I have already restarted the event store service

Omar

You can just search for the text "faulted" in the logs.

Hello Pieter

Here are all of the faulted found through NotePad++ in the logs folder

Hope this helps

Omar

Search “faulted” (8 hits in 3 files)
C:\ProgramData\chocolatey\lib\eventstore-oss\tools\logs\2017-05-26\0.0.0.0-2113-cluster-node-err.log (1 hit)
Line 4524: [PID:03884:019 2017.05.26 21:20:25.649 ERROR ProjectionManager ] The ‘$by_category’ projection faulted due to ‘After retrying 5 times, we failed to write the checkpoint for $by_category to $projections-$by_category-checkpoint due to a CommitTimeout’
C:\ProgramData\chocolatey\lib\eventstore-oss\tools\logs\2017-05-26\0.0.0.0-2113-cluster-node.log (6 hits)
Line 536309: [PID:03884:027 2017.05.26 21:12:34.679 DEBUG ResponseWriter ] PROJECTIONS: Scheduling the writing of $faulted to $projections-$master. Current status of Writer: Busy: True
Line 540377: [PID:03884:027 2017.05.26 21:12:35.991 DEBUG ResponseWriter ] PROJECTIONS: Finished writing events to $projections-$master: $faulted
Line 697794: [PID:03884:019 2017.05.26 21:20:25.649 DEBUG ProjectionManagerRes] PROJECTIONS: Response received: 6770389@$faulted
Line 697795: [PID:03884:019 2017.05.26 21:20:25.649 ERROR ProjectionManager ] The ‘$by_category’ projection faulted due to ‘After retrying 5 times, we failed to write the checkpoint for $by_category to $projections-$by_category-checkpoint due to a CommitTimeout’
Line 968992: [PID:03884:025 2017.05.26 21:27:43.024 DEBUG ResponseWriter ] PROJECTIONS: Finished writing events to $projections-$master: $faulted
Line 968998: [PID:03884:018 2017.05.26 21:27:44.227 DEBUG ResponseWriter ] PROJECTIONS: Finished writing events to $projections-$master: $faulted
C:\ProgramData\chocolatey\lib\eventstore-oss\tools\logs\2017-05-29\0.0.0.0-2113-cluster-node-err.log (1 hit)
Line 7714: [PID:01692:020 2017.05.29 14:55:35.422 ERROR ProjectionManager ] The ‘proj-ComponentGeometriesProjection’ projection faulted due to ‘Failed to write an events to $$$projections-proj-ComponentGeometriesProjection-Geometry-136e0441-75c4-4bc6-aa8d-6127ae55c486-checkpoint. Retry limit of 5 reached. Reason: CommitTimeout’

Omar,

This was the issue we saw as well with the timeout.

It ended up being one of our projections, which was using “fromAll” so as our Event Store was growing, so was the amount of times this projection was being executed.

Do you happen to have “fromAll” being used?

Hello Steven

Here is a projection that fails even though it does not use fromAll. We do not use fromAll, we rely heavily on fromCatgory as illustrated below
fromCategory(‘Geometry’)
.foreachStream()
.when(
{
GeometryCreatedEvent:function(state, event){
linkTo(“Component-”+event.body.ComponentId+"-ChildrenGeometries", event);
return state;
},

            }
        );

Here is a print screen of a failed projection.

Should I just restart the service whenever this happens. Is it possible to raise the timeout used for computing the projections?

Omar
Auto Generated Inline Image 1.png

From the provided logs, it would seem that your projections are faulting for a valid reason which is basically it failing to write an event. There is one for the $by_category checkpoint failing to be written after 5 retries.

As for your question to Steven whether you can up the compilation timeout. From the logs you provided (feel free to drop all of your logs here in an archive for us to go through) increasing the compilation timeout won’t help in this case and we need to further diagnose why your projections are struggling to write events to Event Store.

Hello Pieter

Please find enclosed the log of the day where the projection faulted
Let me know if you need more
Omar

https://drive.google.com/open?id=0B5X5ITZq4xz3U1YyOXJyNmRWWEE

Hello All
The event store seems to have failed creating its projections after we restarted the service.
Auto Generated Inline Image 1.png

Here is the exception from the Event Store :
System.NotSupportedException: Specified method is not supported.
at EventStore.Projections.Core.Services.Management.ManagedProjection.StopUnlessPreparedOrLoaded() in c:\projects\eventstore\src\EventStore.Projections.Core\Services\Management\ManagedProjection.cs:line 882
at EventStore.Core.Bus.MessageHandler`1.TryHandle(Message message) in c:\projects\eventstore\src\EventStore.Core\Bus\MessageHandler.cs:line 33
at EventStore.Core.Bus.InMemoryBus.Publish(Message message) in c:\projects\eventstore\src\EventStore.Core\Bus\InMemoryBus.cs:line 324
at EventStore.Core.Bus.QueuedHandlerMresWithMpsc.ReadFromQueue(Object o) in c:\projects\eventstore\src\EventStore.Core\Bus\QueuedHandlerMRESWithMPSC.cs:line 134

More exceptions
[PID:03968:013 2017.05.31 15:15:52.331 ERROR ProjectionManager ] The ‘proj-GeometricalGroupGeometriesProjection’ projection faulted due to ‘Failed to write an events to $$$projections-proj-GeometricalGroupGeometriesProjection-Geometry-3634e7f1-4165-46af-bd34-377bb1ceb588-checkpoint. Retry limit of 5 reached. Reason: CommitTimeout’
[PID:03968:013 2017.05.31 15:15:52.331 ERROR ProjectionManager ] The ‘proj-SupplierComponentsProjection’ projection faulted due to ‘Failed to write an events to $$$projections-proj-SupplierComponentsProjection-Component-5ce7be66-aa6f-416f-9499-9ae81eadd2cd-checkpoint. Retry limit of 5 reached. Reason: CommitTimeout’
[PID:03968:013 2017.05.31 15:15:52.331 ERROR ProjectionManager ] The ‘proj-FamilyComponentsProjection’ projection faulted due to ‘Failed to write an events to Family-e6a5013a-2f95-4190-b1f1-414b93855a6c-ChildrenComponents. Retry limit of 5 reached. Reason: CommitTimeout’
[PID:03968:013 2017.05.31 15:15:53.455 ERROR ProjectionManager ] The ‘proj-ProjectionForExtractionSignature’ projection faulted due to ‘Failed to write an events to $$$projections-proj-ProjectionForExtractionSignature-Geometry-c04d500a-75ca-4a61-80db-767dca04bc7d-checkpoint. Retry limit of 5 reached. Reason: CommitTimeout’
[PID:03968:013 2017.05.31 15:15:53.455 ERROR ProjectionManager ] The ‘proj-ProjectionForFamiliesByGeometry’ projection faulted due to ‘Failed to write an events to $projections-proj-ProjectionForFamiliesByGeometry-GeometricalGroup-389f0913-7cd1-432f-8a4d-758be4f7960e-checkpoint. Retry limit of 5 reached. Reason: CommitTimeout’
[PID:03968:010 2017.05.31 18:01:33.046 FATAL GLOBAL-LOGGER ] Global Unhandled Exception occurred.
System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
at System.ThrowHelper.ThrowInvalidOperationException(ExceptionResource resource)
at System.Collections.Generic.Dictionary2.ValueCollection.Enumerator.MoveNext() at System.Linq.Enumerable.Any[TSource](IEnumerable1 source, Func`2 predicate)
at EventStore.Projections.Core.Services.Processing.MultiStreamEventReader.Handle(ReadTimeout message) in c:\projects\eventstore\src\EventStore.Projections.Core\Services\Processing\MultiStreamEventReader.cs:line 173
at EventStore.Core.Messaging.SendToThisEnvelope.ReplyWith[T](T message) in c:\projects\eventstore\src\EventStore.Core\Messaging\SendToThisEnvelope.cs:line 20
at EventStore.Core.Services.TimerService.ThreadBasedScheduler.DoTiming() in c:\projects\eventstore\src\EventStore.Core\Services\TimerService\ThreadBasedScheduler.cs:line 68
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.ThreadHelper.ThreadStart()

It seems that the category projection has issues being computed again.

Would you guys offer any alternative to using the projections ? A general idea as to the source of the problem would be fine too

Best regards

Omar Msaaf

0.0.0.0-2113-cluster-node-err.log (1.5 MB)

0.0.0.0-2113-cluster-node-err.log (1.15 MB)

Hi Omar,
Given the previous set of logs, it would appear your environment is not a good place… The previous logs are filled with commit timeouts and the projection eventually faulting due to the number of attempted retries to write those events are exhausted.