System Projections intermittently failing

Hello.

We've had an issue the past couple of days with the system projections in EventStore 4.1.0.

So we are streaming events to the event store, sometimes at quite a volume with a large event every now and then. The reasons the projections give as to why they are failing is that they are failing to write their checkpoints after 5 retries because they time out. Looking through the logs from the server, I can't really gleam anything about why it is timing out, just that it is and that is failing the projections.

As a second point, on failing and trying to re-enable the projections, they would occasionally get stuck in a Prepared/Initial state.

These are the errors that we get when the projections fail:

Failed to write events to $et-VehicleTaxonomySupplied. Retry limit of 5 reached. Reason: CommitTimeout. Checkpoint: C:60617520740/P:60617520740

Failed to write events to $et-LineImageStatsReported. Retry limit of 5 reached. Reason: CommitTimeout. Checkpoint: C:60056375316/P:60056375316.

After retrying 5 times, we failed to write the checkpoint for $by_event_type to $projections-$by_event_type-checkpoint due to a CommitTimeout

So the taxonomy event is our big one (12kb) (there is a view to split it up into smaller events in the future)

LineImageStats is a very small event (just a couple of ints) and the last one is the projection failing to write its checkpoint.

Any ideas?

We have a similar issue. Whenever we deploy (100+ microservices) we load down EventStore enough that the projections commit timeout and we have to manually restart them, like clockwork. And yeah, in rare cases we get them being “stuck” which requires a master node restart (we avoid this best we can by waiting until the environment stabilizes and load to EventStore is minimized before restarting the projections). Sadly the only solution I know of is to beef up your nodes more, although I’ve seen suggestions for tweaking the projection settings.