Has anyone experienced that the category projection stopped?

Tomas_Jansson · November 9, 2015, 9:24am

We noticed today that the category projection stopped this weekend. In the UI it was reported that the category projection was 100% done, but that stream was missing the last 100-200 events. After a restart of event store it caught up as expected and now works as expected. Any ideas why this might have happened?

ChristianD · November 27, 2015, 8:58pm

I had a similar occurrence except with the $by_event_type projection. Stopping and starting had no effect. It seemed to happen when memory usage went crazy. All was well on a restart.

Greg_Young1 · November 27, 2015, 9:20pm

for UI this is likely a rounding error at such small numbers.

Tomas_Jansson · November 27, 2015, 10:19pm

I’m not sure what you mean Greg. If a projection stops, that will most likely mean no updates in the UI what so ever in our case. The way we have it setup is using a category for all our events, since we doesn’t have that many, and build up an elasticsearch index based on that. If the last events are missing I would say it is more than a rounding error since those events belongs to the aggregates the users are working on at the moment and need an updated moment. Also, there was nothing that indicated that the projection had stop except from the elasticsearch index wasn’t updated.

Greg_Young1 · November 27, 2015, 10:26pm

I mean the 100%in UI given say 200 days of operations, what the UI
shows you is probably a rounding error. What are the actual numbers?
We can probably improve the UI in showing this. Consider 200 days @
400 events/day and now its 100 events behind ... (400*200)-100 /
(400*200)

Greg

Tomas_Jansson · November 27, 2015, 10:55pm

It wasn’t a problem with what the UI showed us, the projection had stopped, I think that is a major difference. The projection was 200-300 events behind and it didn’t get updated when I did actions that resulted in events that should be added to the projection. After restart it all started working again as expected, but it was nothing telling we that the projection had stopped.

Greg_Young1 · November 28, 2015, 7:27am

what was the state of the projection?

Tomas_Jansson · December 16, 2015, 12:45pm

This actually happened again in a test environment, it seems like do so when there have been any activity for a couple of days. We didn’t see any error in the logs or anything, just that the projection had stopped.

Justin_Litchfield · January 19, 2016, 4:51pm

We’ve had this happen to us twice now. It’s a known issue (I corresponded with ES on the subject on Nov 25), and it’s pretty surprising to me that Greg was pretending that you are at fault in your observation of it.

The first time I opened a ticket (we have the commercial support) and were told that “it happens sometimes, you just need to shutdown the master node.” Which is unsatisfying.

The second time was this morning.

It’s hard to describe the insidiousness of this failure mode. We run a category stream projection that we are polling to ensure that our application state stays “Fresh”, but what winds up happening is that the projection just stops itself, and returns events, say 0-5119 out of the 6000 events in the stream. Then the application gets to event 5119, the category projections says that is “fresh”, and our code executes based on that. It’s a lie. But we can’t tell it’s a lie.

So basically the workaround for this is to stand up another piece of infrastructure that will:

curl the EventStore (undocumented, we discovered it through the chrome inspector that the front-end uses) endpoint to get the projection status:

curl -H "Accept: application/json"  eventstore.local:2113/projections/all-non-transient

Check that response to find out if the projections are all “Running” or “Running/Paused” (I have no idea what Running/Paused could mean, but know that they tend to stall out in a “Preparing” state). A “happy” response looks like this:

{

"projections": [

{

"coreProcessingTime": 1557,

"version": 1,

"epoch": -1,

"effectiveName": "$by_event_type",

"writesInProgress": 0,

"readsInProgress": 0,

"partitionsCached": 1,

"status": "Running",

"stateReason": "",

"name": "$by_event_type",

"mode": "Continuous",

"position": "C:4933870301/P:4933870301",

"progress": 100.0,

"lastCheckpoint": "C:4928276394/P:4928276394",

"eventsProcessedAfterRestart": 95718,

"statusUrl": "http://eventstore.local:2113/projection/$by_event_type",

"stateUrl": "http://eventstore.local:2113/projection/$by_event_type/state",

"resultUrl": "http://eventstore.local:2113/projection/$by_event_type/result",

"queryUrl": "http://eventstore.local:2113/projection/$by_event_type/query%3Fconfig=yes",

"enableCommandUrl": "http://eventstore.local:2113/projection/$by_event_type/command/enable",

"disableCommandUrl": "http://eventstore.local:2113/projection/$by_event_type/command/disable",

"checkpointStatus": "",

"bufferedEvents": 0,

"writePendingEventsBeforeCheckpoint": 0,

"writePendingEventsAfterCheckpoint": 0

},

{

"coreProcessingTime": 2383,

"version": 1,

"epoch": -1,

"effectiveName": "$by_category",

"writesInProgress": 0,

"readsInProgress": 0,

"partitionsCached": 1,

"status": "Running",

"stateReason": "",

"name": "$by_category",

"mode": "Continuous",

"position": "C:4933870301/P:4933870301",

"progress": 100.0,

"lastCheckpoint": "C:4928276394/P:4928276394",

"eventsProcessedAfterRestart": 95718,

"statusUrl": "http://eventstore.local:2113/projection/$by_category",

"stateUrl": "http://eventstore.local:2113/projection/$by_category/state",

"resultUrl": "http://eventstore.local:2113/projection/$by_category/result",

"queryUrl": "http://eventstore.local:2113/projection/$by_category/query%3Fconfig=yes",

"enableCommandUrl": "http://eventstore.local:2113/projection/$by_category/command/enable",

"disableCommandUrl": "http://eventstore.local:2113/projection/$by_category/command/disable",

"checkpointStatus": "",

"bufferedEvents": 0,

"writePendingEventsBeforeCheckpoint": 0,

"writePendingEventsAfterCheckpoint": 0

},

{

"coreProcessingTime": 326,

"version": 1,

"epoch": -1,

"effectiveName": "$streams",

"writesInProgress": 0,

"readsInProgress": 0,

"partitionsCached": 1,

"status": "Running",

"stateReason": "",

"name": "$streams",

"mode": "Continuous",

"position": "C:4933870301/P:4933870301",

"progress": 100.0,

"lastCheckpoint": "C:4928276394/P:4928276394",

"eventsProcessedAfterRestart": 95718,

"statusUrl": "http://eventstore.local:2113/projection/$streams",

"stateUrl": "http://eventstore.local:2113/projection/$streams/state",

"resultUrl": "http://eventstore.local:2113/projection/$streams/result",

"queryUrl": "http://eventstore.local:2113/projection/$streams/query%3Fconfig=yes",

"enableCommandUrl": "http://eventstore.local:2113/projection/$streams/command/enable",

"disableCommandUrl": "http://eventstore.local:2113/projection/$streams/command/disable",

"checkpointStatus": "",

"bufferedEvents": 0,

"writePendingEventsBeforeCheckpoint": 0,

"writePendingEventsAfterCheckpoint": 0

},

{

"coreProcessingTime": 269,

"version": 1,

"epoch": -1,

"effectiveName": "$stream_by_category",

"writesInProgress": 0,

"readsInProgress": 0,

"partitionsCached": 1,

"status": "Running",

"stateReason": "",

"name": "$stream_by_category",

"mode": "Continuous",

"position": "C:4933870301/P:4933870301",

"progress": 100.0,

"lastCheckpoint": "C:4928276394/P:4928276394",

"eventsProcessedAfterRestart": 95718,

"statusUrl": "http://eventstore.local:2113/projection/$stream_by_category",

"stateUrl": "http://eventstore.local:2113/projection/$stream_by_category/state",

"resultUrl": "http://eventstore.local:2113/projection/$stream_by_category/result",

"queryUrl": "http://eventstore.local:2113/projection/$stream_by_category/query%3Fconfig=yes",

"enableCommandUrl": "http://eventstore.local:2113/projection/$stream_by_category/command/enable",

"disableCommandUrl": "http://eventstore.local:2113/projection/$stream_by_category/command/disable",

"checkpointStatus": "",

"bufferedEvents": 0,

"writePendingEventsBeforeCheckpoint": 0,

"writePendingEventsAfterCheckpoint": 0

}

]

}

If the states do not contain “Running”, then restart the master node by sending this (undocumented) curl command


curl -d "" -v admin:<password>@10.0.0.92:2113/admin/shutdown

I must say, it’s the worst feeling workaround I have ever had to do for a piece of commercial software.

Justin

jen20 · January 19, 2016, 5:05pm

Stopping is a different issue to always reading 99.9%. They are unrelated. One is a display issue, the freezing is a bug for which the current workaround (for non-production feature - remember) is to kill the master node and let another take over.

Scott_Bellware · January 19, 2016, 5:28pm

Seriously… are projections ever going to work? We have to make a decision as to whether and when to start investing in other technologies and an ingress pipeline if not.

We like the convenience of having these capabilities in a single product. I think of it as the ActiveRecord of streaming: optimized for developer convenience and appearance of consistency.

But if projections aren’t going to ship, then I think it would be an ethical and honorable thing to let it be known. And in light of the delays, it would be great to get regular updates from the dev team (or the dev team’s manager) as to progress, setbacks, and revised estimates.

If we build a pipeline out of available commodity open source tools, then we’ll probably end up opting for things that are more commonly known and broadly adopted. We have to start our work in learning and developing soon-ish so that we’re not caught short.

It would be super to have more insight into the management of EventStore development so that we can make better decisions.

Best,

Scott

Jef_Claes · October 10, 2017, 5:59am

Same thing happened to me yesterday (v 4.0.1.0). First time this happened since the 4 months or so we’re in production. UI reported started/paused, I think. Rebooting master solved it.

Austin_Salgat · November 15, 2017, 9:15pm

Upgrading to 4.0.3 significantly helped with projection failures for us. Also increasing your -CommitTimeoutMs helps (and if you’re dumb enough like me, eventually realizing you need to switch to SSD instead of HD for your storage…).

Poule_Dodue · November 16, 2017, 7:02am

projection failures are linked to disk type?

Greg_Young1 · November 16, 2017, 7:02am

Can be if they are getting write timeouts.