Projections are failing silently

Jerry_Verhoef · September 16, 2016, 2:06pm

We are using projections but sometimes they are failing without any clear reason. I need some help debugging this problem or investigating what is wrong.
Multiple projections are occasionally failing, usually only one or two at the time. After resetting the projection they are working just fine again.

When I retrieve status of the projection I get the following info:

{

“coreProcessingTime”: 2,

“version”: 1,

“epoch”: -1,

“effectiveName”: “frontend.authorization.roles_per_person.2”,

“writesInProgress”: 0,

“readsInProgress”: 0,

“partitionsCached”: 2,

“status”: “Running”,

“stateReason”: “”,

“name”: “frontend.authorization.roles_per_person.2”,

“mode”: “Continuous”,

“position”: “$ce-psybizz.neon.domainmodel.person: 15”,

“progress”: 100.0,

“lastCheckpoint”: “$ce-psybizz.neon.domainmodel.person: -1”,

“eventsProcessedAfterRestart”: 6,

“statusUrl”: “http:///projection/frontend.authorization.roles_per_person.2”,

“stateUrl”: “http:///projection/frontend.authorization.roles_per_person.2/state”,

“resultUrl”: “http:///projection/frontend.authorization.roles_per_person.2/result”,

“queryUrl”: “http:///projection/frontend.authorization.roles_per_person.2/query%3Fconfig=yes”,

“enableCommandUrl”: “http:///projection/frontend.authorization.roles_per_person.2/command/enable”,

“disableCommandUrl”: “http:///projection/frontend.authorization.roles_per_person.2/command/disable”,

“checkpointStatus”: “”,

“bufferedEvents”: 0,

“writePendingEventsBeforeCheckpoint”: 0,

“writePendingEventsAfterCheckpoint”: 0

}

``

When I try to retrieve the state or result of the projection it returns an empty result:

curl http:///projection/frontend.authorization.roles_per_person.2/state -I -XGET

HTTP/1.1 200 OK

Access-Control-Allow-Methods: GET, OPTIONS

Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-Forwarded-Host, X-PINGOTHER, Authorization, ES-LongPoll, ES-ExpectedVersion, ES-EventId, ES-EventType, ES-RequiresMaster, ES-HardDelete, ES-ResolveLinkTo, ES-ExpectedVersion

Access-Control-Allow-Origin: *

Access-Control-Expose-Headers: Location, ES-Position

ES-Position: {"$s":{"$ce-psybizz.neon.domainmodel.person":15}}

Cache-Control: max-age=0, no-cache, must-revalidate

Vary: Accept

Content-Type: application/json; charset=utf-8

Server: Mono-HTTPAPI/1.0

Content-Length: 0

Keep-Alive: timeout=15,max=10

``

Code of this projection:

fromCategory(‘psybizz.neon.domainmodel.person’)

.foreachStream()

.when({

‘$init’: function() {

return [];

},

‘Psybizz\Neon\DomainModel\Person\RoleAssignedToPerson’: roleAssignedToPersonHandler

});

function roleAssignedToPersonHandler(state, event) {

state.push(event.data.roleDefinitionId);

return state;

}

``

Small note: This information is retrieved from my local dev stack which is ES3.5, but we have the same problem on our acceptance environment which we just upgraded to 3.9.1

pieter.germishuys · September 16, 2016, 3:48pm

Hi Jerry,
Thanks for reporting this, the best place to report these types of issues is on github so that we can track it appropriately as we are moving towards 4.0.0 (projections out of beta version) and we would like to resolve these types of issues.

You haven’t really attached any logs which might point to an issue such as checkpoint write failures etc etc…

Would you kindly please open an issue on the github repository for Event Store and attach some logs and we can help investigate from there?

Also, are the only symptoms you are seeing is that the state is not available?

Thank you once again

Jerry_Verhoef · September 23, 2016, 1:32pm

Hi Pieter,

Sorry for the late reply but I had wait for a case when it happend again. Since we cannot trigger it and it doesn’t happen on a regular interval (several times a week). Both the state as the results are completely empty. The log files itself are not reporting any error messages at all. (As far I can see)

I have created an Gist. The Gist contains three files: https://gist.github.com/JerryVerhoef/b3fafa6ec553ae5b10294cbafe665009

log file around the time of the failure
log file after the reset
the projection itself

I will create an issue out it on Github.

I do have some other questions which might me related to the issue (wild guessing atm):

Is it possible that another unrelated projection is causing the other projections to fail?
Is there a limit on the number of projections?
Is a projection a long running process?
In our projections we use some global helper variables so that we don’t need to pollute the state, this might break if the projection is not long running process. Is this allowed or are we making an incorrect assumption

Jerry

pieter.germishuys · September 23, 2016, 6:52pm

Hi Jerry,
No worries on the late reply.

Will review the gist but first let me answer your questions

Is it possible that another unrelated projection is causing the other projections to fail?

Depends entirely on what the failure reason is, a common one would be that one projection is attempting to write to a stream that another projection is writing to.

Is there a limit on the number of projections?

There is no hard limit on the number of projections

Is a projection a long running process?

In our projections we use some global helper variables so that we don’t need to pollute the state, this might break if the projection is not long running process. Is this allowed or are we making an incorrect assumption

Each projection runs in it’s own little VM and as such when the node running the projection dies we don’t take a snapshot of the state. Projections relies on checkpoints to start from where they left off. Put the variables you want in state rather than relying on the projections being long running processes.

Greg_Young1 · September 23, 2016, 7:33pm

Projections follows a functional model the functions in the pattern
match (eg FooEvent, $any, $init) should be pure.

Jerry_Verhoef · September 26, 2016, 3:54pm

Hi Jerry,
No worries on the late reply.

Will review the gist but first let me answer your questions

Is it possible that another unrelated projection is causing the other projections to fail?

Depends entirely on what the failure reason is, a common one would be that one projection is attempting to write to a stream that another projection is writing to.

Is there a limit on the number of projections?

There is no hard limit on the number of projections

Is a projection a long running process?

In our projections we use some global helper variables so that we don’t need to pollute the state, this might break if the projection is not long running process. Is this allowed or are we making an incorrect assumption

Each projection runs in it’s own little VM and as such when the node running the projection dies we don’t take a snapshot of the state. Projections relies on checkpoints to start from where they left off. Put the variables you want in state rather than relying on the projections being long running processes.

Ahhh this might cause some our problems. We will update all of our projections (they are under test so that shouldn’t be that difficult)

One more small question:

Is it allowed to have constants helper variables / private helper functions ? We have some variables for STREAM_CATEGORY and STREAM_NAMESPACE to prevent that we are writing to a stream that another projection is writing to.

I am quite confident that this will solve many of our problems

Kind regards,

Jerry

pieter.germishuys · September 26, 2016, 5:27pm

I am not sure I understand your question?
Can you possibly give me an example of how you are using those constants?

Jerry_Verhoef · September 26, 2016, 5:57pm

Sure.

The most common case is something like this.

pieter.germishuys · September 26, 2016, 5:59pm

Yes, that’s fine.

Jerry_Verhoef · September 27, 2016, 2:04pm

Ha Pieter,

We rebooted the cluster this morning, after close monitoring of all log files we noticed the following error message in the error log. I couldn’t correlate it to the earlier problem but it might be related:

[PID:12847:122 2016.09.27 11:26:19.534 ERROR QueuedHandlerAutoRes] Error while processing message EventStore.Core.Messages.ClientMessage+ReadStreamEventsBackwardCompleted in queued handler ‘Projection Core #0’.
System.Exception: StartObject or EndArray expected (At: 1, 2)
at EventStore.Projections.Core.Services.Processing.PartitionState.Error (Newtonsoft.Json.JsonTextReader reader, System.String message) [0x00000] in :0
at EventStore.Projections.Core.Services.Processing.PartitionState.Deserialize (System.String serializedState, EventStore.Projections.Core.Services.Processing.CheckpointTag causedBy) [0x00000] in :0
at EventStore.Projections.Core.Services.Processing.DefaultCheckpointManager.OnLoadPartitionStateReadStreamEventsBackwardCompleted (EventStore.Core.Messages.ReadStreamEventsBackwardCompleted message, EventStore.Projections.Core.Services.Processing.CheckpointTag requestedStateCheckpointTag, System.Action1 loadCompleted, System.String partitionStreamName, System.String stateEventType) [0x00000] in <filename unknown>:0 at EventStore.Projections.Core.Services.Processing.DefaultCheckpointManager+<BeginLoadPartitionStateAt>c__AnonStorey0.<>m__0 (EventStore.Core.Messages.ReadStreamEventsBackwardCompleted m) [0x00000] in <filename unknown>:0 at EventStore.Core.Messaging.RequestResponseDispatcher2[EventStore.Core.Messages.ClientMessage+ReadStreamEventsBackward,EventStore.Core.Messages.ClientMessage+ReadStreamEventsBackwardCompleted].Handle (EventStore.Core.Messages.ReadStreamEventsBackwardCompleted message) [0x00000] in :0
at EventStore.Core.Messaging.RequestResponseDispatcher2[EventStore.Core.Messages.ClientMessage+ReadStreamEventsBackward,EventStore.Core.Messages.ClientMessage+ReadStreamEventsBackwardCompleted].EventStore.Core.Bus.IHandle<TResponse>.Handle (EventStore.Core.Messages.ReadStreamEventsBackwardCompleted message) [0x00000] in <filename unknown>:0 at EventStore.Core.Bus.MessageHandler1[EventStore.Core.Messages.ClientMessage+ReadStreamEventsBackwardCompleted].TryHandle (EventStore.Core.Messaging.Message message) [0x00000] in :0
at EventStore.Core.Bus.InMemoryBus.Publish (EventStore.Core.Messaging.Message message) [0x00000] in :0
at EventStore.Core.Bus.InMemoryBus.Handle (EventStore.Core.Messaging.Message message) [0x00000] in :0
at EventStore.Core.Bus.QueuedHandlerAutoReset.ReadFromQueue (System.Object o) [0x00000] in :0

I grep’d a week worths of logs to see if the same error occurred and I noticed it happend before.