We noticed today that the category projection stopped this weekend. In the UI it was reported that the category projection was 100% done, but that stream was missing the last 100-200 events. After a restart of event store it caught up as expected and now works as expected. Any ideas why this might have happened?
I had a similar occurrence except with the $by_event_type projection. Stopping and starting had no effect. It seemed to happen when memory usage went crazy. All was well on a restart.
for UI this is likely a rounding error at such small numbers.
I’m not sure what you mean Greg. If a projection stops, that will most likely mean no updates in the UI what so ever in our case. The way we have it setup is using a category for all our events, since we doesn’t have that many, and build up an elasticsearch index based on that. If the last events are missing I would say it is more than a rounding error since those events belongs to the aggregates the users are working on at the moment and need an updated moment. Also, there was nothing that indicated that the projection had stop except from the elasticsearch index wasn’t updated.
I mean the 100%in UI given say 200 days of operations, what the UI
shows you is probably a rounding error. What are the actual numbers?
We can probably improve the UI in showing this. Consider 200 days @
400 events/day and now its 100 events behind ... (400*200)-100 /
(400*200)
Greg
It wasn’t a problem with what the UI showed us, the projection had stopped, I think that is a major difference. The projection was 200-300 events behind and it didn’t get updated when I did actions that resulted in events that should be added to the projection. After restart it all started working again as expected, but it was nothing telling we that the projection had stopped.
what was the state of the projection?
This actually happened again in a test environment, it seems like do so when there have been any activity for a couple of days. We didn’t see any error in the logs or anything, just that the projection had stopped.
We’ve had this happen to us twice now. It’s a known issue (I corresponded with ES on the subject on Nov 25), and it’s pretty surprising to me that Greg was pretending that you are at fault in your observation of it.
The first time I opened a ticket (we have the commercial support) and were told that “it happens sometimes, you just need to shutdown the master node.” Which is unsatisfying.
The second time was this morning.
It’s hard to describe the insidiousness of this failure mode. We run a category stream projection that we are polling to ensure that our application state stays “Fresh”, but what winds up happening is that the projection just stops itself, and returns events, say 0-5119 out of the 6000 events in the stream. Then the application gets to event 5119, the category projections says that is “fresh”, and our code executes based on that. It’s a lie. But we can’t tell it’s a lie.
So basically the workaround for this is to stand up another piece of infrastructure that will:
-
curl
the EventStore (undocumented, we discovered it through the chrome inspector that the front-end uses) endpoint to get the projection status:
curl -H "Accept: application/json" eventstore.local:2113/projections/all-non-transient
- Check that response to find out if the projections are all “Running” or “Running/Paused” (I have no idea what Running/Paused could mean, but know that they tend to stall out in a “Preparing” state). A “happy” response looks like this:
{
"projections": [
{
"coreProcessingTime": 1557,
"version": 1,
"epoch": -1,
"effectiveName": "$by_event_type",
"writesInProgress": 0,
"readsInProgress": 0,
"partitionsCached": 1,
"status": "Running",
"stateReason": "",
"name": "$by_event_type",
"mode": "Continuous",
"position": "C:4933870301/P:4933870301",
"progress": 100.0,
"lastCheckpoint": "C:4928276394/P:4928276394",
"eventsProcessedAfterRestart": 95718,
"statusUrl": "http://eventstore.local:2113/projection/$by_event_type",
"stateUrl": "http://eventstore.local:2113/projection/$by_event_type/state",
"resultUrl": "http://eventstore.local:2113/projection/$by_event_type/result",
"queryUrl": "http://eventstore.local:2113/projection/$by_event_type/query%3Fconfig=yes",
"enableCommandUrl": "http://eventstore.local:2113/projection/$by_event_type/command/enable",
"disableCommandUrl": "http://eventstore.local:2113/projection/$by_event_type/command/disable",
"checkpointStatus": "",
"bufferedEvents": 0,
"writePendingEventsBeforeCheckpoint": 0,
"writePendingEventsAfterCheckpoint": 0
},
{
"coreProcessingTime": 2383,
"version": 1,
"epoch": -1,
"effectiveName": "$by_category",
"writesInProgress": 0,
"readsInProgress": 0,
"partitionsCached": 1,
"status": "Running",
"stateReason": "",
"name": "$by_category",
"mode": "Continuous",
"position": "C:4933870301/P:4933870301",
"progress": 100.0,
"lastCheckpoint": "C:4928276394/P:4928276394",
"eventsProcessedAfterRestart": 95718,
"statusUrl": "http://eventstore.local:2113/projection/$by_category",
"stateUrl": "http://eventstore.local:2113/projection/$by_category/state",
"resultUrl": "http://eventstore.local:2113/projection/$by_category/result",
"queryUrl": "http://eventstore.local:2113/projection/$by_category/query%3Fconfig=yes",
"enableCommandUrl": "http://eventstore.local:2113/projection/$by_category/command/enable",
"disableCommandUrl": "http://eventstore.local:2113/projection/$by_category/command/disable",
"checkpointStatus": "",
"bufferedEvents": 0,
"writePendingEventsBeforeCheckpoint": 0,
"writePendingEventsAfterCheckpoint": 0
},
{
"coreProcessingTime": 326,
"version": 1,
"epoch": -1,
"effectiveName": "$streams",
"writesInProgress": 0,
"readsInProgress": 0,
"partitionsCached": 1,
"status": "Running",
"stateReason": "",
"name": "$streams",
"mode": "Continuous",
"position": "C:4933870301/P:4933870301",
"progress": 100.0,
"lastCheckpoint": "C:4928276394/P:4928276394",
"eventsProcessedAfterRestart": 95718,
"statusUrl": "http://eventstore.local:2113/projection/$streams",
"stateUrl": "http://eventstore.local:2113/projection/$streams/state",
"resultUrl": "http://eventstore.local:2113/projection/$streams/result",
"queryUrl": "http://eventstore.local:2113/projection/$streams/query%3Fconfig=yes",
"enableCommandUrl": "http://eventstore.local:2113/projection/$streams/command/enable",
"disableCommandUrl": "http://eventstore.local:2113/projection/$streams/command/disable",
"checkpointStatus": "",
"bufferedEvents": 0,
"writePendingEventsBeforeCheckpoint": 0,
"writePendingEventsAfterCheckpoint": 0
},
{
"coreProcessingTime": 269,
"version": 1,
"epoch": -1,
"effectiveName": "$stream_by_category",
"writesInProgress": 0,
"readsInProgress": 0,
"partitionsCached": 1,
"status": "Running",
"stateReason": "",
"name": "$stream_by_category",
"mode": "Continuous",
"position": "C:4933870301/P:4933870301",
"progress": 100.0,
"lastCheckpoint": "C:4928276394/P:4928276394",
"eventsProcessedAfterRestart": 95718,
"statusUrl": "http://eventstore.local:2113/projection/$stream_by_category",
"stateUrl": "http://eventstore.local:2113/projection/$stream_by_category/state",
"resultUrl": "http://eventstore.local:2113/projection/$stream_by_category/result",
"queryUrl": "http://eventstore.local:2113/projection/$stream_by_category/query%3Fconfig=yes",
"enableCommandUrl": "http://eventstore.local:2113/projection/$stream_by_category/command/enable",
"disableCommandUrl": "http://eventstore.local:2113/projection/$stream_by_category/command/disable",
"checkpointStatus": "",
"bufferedEvents": 0,
"writePendingEventsBeforeCheckpoint": 0,
"writePendingEventsAfterCheckpoint": 0
}
]
}
- If the states do not contain “Running”, then restart the master node by sending this (undocumented)
curl
command
curl -d "" -v admin:<password>@10.0.0.92:2113/admin/shutdown
I must say, it’s the worst feeling workaround I have ever had to do for a piece of commercial software.
Justin
Stopping is a different issue to always reading 99.9%. They are unrelated. One is a display issue, the freezing is a bug for which the current workaround (for non-production feature - remember) is to kill the master node and let another take over.
Seriously… are projections ever going to work? We have to make a decision as to whether and when to start investing in other technologies and an ingress pipeline if not.
We like the convenience of having these capabilities in a single product. I think of it as the ActiveRecord of streaming: optimized for developer convenience and appearance of consistency.
But if projections aren’t going to ship, then I think it would be an ethical and honorable thing to let it be known. And in light of the delays, it would be great to get regular updates from the dev team (or the dev team’s manager) as to progress, setbacks, and revised estimates.
If we build a pipeline out of available commodity open source tools, then we’ll probably end up opting for things that are more commonly known and broadly adopted. We have to start our work in learning and developing soon-ish so that we’re not caught short.
It would be super to have more insight into the management of EventStore development so that we can make better decisions.
Best,
Scott
Same thing happened to me yesterday (v 4.0.1.0). First time this happened since the 4 months or so we’re in production. UI reported started/paused, I think. Rebooting master solved it.
Upgrading to 4.0.3 significantly helped with projection failures for us. Also increasing your -CommitTimeoutMs helps (and if you’re dumb enough like me, eventually realizing you need to switch to SSD instead of HD for your storage…).
projection failures are linked to disk type?
Can be if they are getting write timeouts.