Questions for those using EventStore in production?

Mark_Nuzz · August 25, 2019, 5:23am

I’m thinking about using EventStore, as it seems to be a good fit for some projects I’m working on. It will need to be reliable, so I am hoping to get some honest opinions from others who currently use or formerly used EventStore in production. How did it work out? Was there a support contract and if so did you feel it was necessary in order to keep the lights on?

We can probably do the Commercial support contract but I have some concerns about using it in a high-reliability scenario, in light of an issue I reported here (The additional context elaborates on the possibility of deeper systemic issues). https://github.com/EventStore/documentation/issues/374

I really want to use EventStore, am a big fan of .NET and functional programming, and overall like the architecture of this. I just want to get an objective picture of what we can expect to have difficulties with if our team accepts my recommendations and commits to using this. Thanks!

Jason_Stevens · August 25, 2019, 8:49am

The company I work for is in the fintech industry so reliability and audit-ability is paramount.

In my experience EventStore has been rock solid for loading and storing aggregates as event streams. I don’t have the event count on hand, but we’re up in the hundreds of millions of events I believe, and accelerating, and I don’t think we’ve ever had an issue (although I’d need to confirm with colleagues).

The UI has a bug with Queries but once you know how to avoid it, it’s workable. We’ve only just started to use Projections (which are excitingly powerful) and have found them stop twice pre-production when we’ve updated them. Once I think is a bug but the second time seem to indicate poor VM specs. But projections going down isn’t a major - it doesn’t affect the rest of ES and we just start them up again and they’re fine. We’re going live with them soon and we’re just about to go live with a three-node cluster which is recommended.

I’ve had limited experience with kafka before and found it critically buggy on Windows. It wasn’t too bad to set up vanilla kafka (not as straight-forward as ES but not bad), but after a couple of days it would hang with file locks (apparently a know bug they’ve had for years but haven’t fixed). A linux container wasn’t an option at the time.

All the best with your decision making.

Mark_Nuzz · August 25, 2019, 9:19am

Jason,

Thanks for the response! Can you tell me anything about the Query UI bug? Is there an issue filed for it in Github somewhere?

Mark

Jason_Stevens · August 25, 2019, 9:58am

Once you run a query, make some changes to the query and run again, it invariably errors out. This is because it hasn’t updated the query text fully. If the text is copied and the page reloaded (which is what I do) you can see the text is missing the last few edits, so I just paste and run again and it’s fine.

I haven’t filed a bug for it. I’m running 5.0.0 locally so there’s a chance it’s been fixed?

Justin_Thirkell · August 26, 2019, 8:16am

We have been running ES in production for a couple of years, have around 44M events afaict.

Before we went into production we spent a fair amount of effort building out observability and alerting, backups and load testing of our ES cluster. Our experience has therefore been pretty good - not without incident, but without any incidents that actually caused us significant problems triaging or resolving. Using cheaper instance types with less performant network cause problems - we picked this up through load testing. We focused on a few key operational stories - time for a new node to stand up, restore from backup and join the cluster (around 12min I think) and alerting on number of master/slave nodes falling outside parameters, alerting on ES fatal logs, etc.

We’ve had not infrequent master elections - and occasionally we see multiple master elections presumably due to network degradation/partitions, until a master is finally agreed upon by the whole set of nodes.

Good monitoring is key to building confidence (which we now have) in ES cluster state management. And the few incidents we had where the cluster actually fell over or a node couldn’t start up properly it was clear to us from our drills how to recover (which was generally shoot the bad node and trust in our auto-recovery to bring the cluster back to stability, which it invariably did). Expect to see things like bad chunks on restore, resulting in a node thrashing while it tries to verify its state.

Other teams within the same company who installed ES and neglected to build out a proper operations underpinning have had more, and more serious problems. To be fair, some of their ES clusters did just keep going and going after an install and walk away… But when they ran out of disk space that was a bit of a hard stop. Other teams have had to invest a lot of effort building out more robust operational support for their ES clusters once they had perf issues in prod that were hard to resolve. So I’d say if you want a good ES operational story you’re going to have to pay for it one way or another - I much prefer paying in advance of any problems.

Justin_Thirkell · August 26, 2019, 8:40am

I should also say - it’s harder than it should be to get a straight answer on exactly what (and how) one should monitor, what alert thresholds are reasonable, what perf should be expected from a tuned ES cluster, etc.
The best thing we did was to find a helpful GES employee to give us an hour to review my document on how we should build out our operational story.

It would be fantastic if GetEventStore could put out a templated, standard approach, ‘you should monitor this and alert if that’ type guidance, this is how you load test, etc, to help bootstrap the efforts of people like, well, everyone who isn’t already running a well-managed ES cluster in prod.

I’ll see if there’s a way to share what we’ve done with operationalising ES - there’s a fair amount of useful (should be open source) IP to be shared.

I’ve heard talk from some in my company that ES is not good enough - but that’s rubbish. You do have to spend effort up front if you want to enjoy it (i’m part of the on-call roster), but if you do that then it’s fairly smooth sailing.

yorick.laupa · August 26, 2019, 3:21pm

I use eventstore in production for five years now. While I never experienced severe issues with the database, I stumble upon quite annoying ones from time to time.

I recommend like everybody to spend time on setting up the right monitoring, alerting system and testing regularly that you can load your backups.

I never experienced performance issues. I had a client once pushing 5000-10k events/sec with no problem, not a big load but not that small neither. I never used the C# driver,

I maintain one of my own in Haskell, that’s what my clients use too.

My chief complaints on the database are about persistent subscriptions. Those do work. However, I did experience use-cases where something I didn’t expect happened (in very few occasions).

For example, having the server no longer sending events to clients connected to the persistent subscriptions (with red colour in the UI). It was never blocking because I could restart the persistent subscription by

updating its settings (without changing anything)

I never use the commercial version of the database. My current employer is considering commercial support as I speak. I hope there is more (debugging) information available in the UI in the commercial version. For example, having more straightforward access to parked messages, better documentation on how persistent subscription checkpoint works. The web UI as an admin is not helpful enough IMHO.

I never lost any data at any point, and the database never was responsible for breaking my applications in production. The most significant eventstore instance I worked on was about 3TB. Again, not crazy big but not small neither.

yorick.laupa · August 26, 2019, 3:30pm

For some reason, 2 sentences of my previous reply are bigger than the rest of the text. I didn’t do it on purpose, it is not me trying to emphasise those points in particular :-]

Greg_Young1 · August 26, 2019, 3:52pm

If you want hit me direct on the persistent subscription issues you
had, I likely know that code better than most. Details/logs/etc can
help.

re: updating settings, this should *always* work. Any update to
settings blows away the current state of the persistent subscription
(this is by design!). I actually tried to work out how to *not*
completely blow things away as quite a few run with large buffers, but
it did not seem fruitful when trying to balance the complexity of it.
As such they are literally just replaced.

re parked messages: When you say more "straight forward access to
parked messages" what do you mean? They are just in a stream. The
stream they are in is included on the subscription definition (but
also follows a pattern which is not likely to be changed anytime
soon). What do you mean by more straight forward access beyond being
able to read the stream?

jageall · August 27, 2019, 12:09pm

We have opened an issue to track this here : https://github.com/EventStore/EventStore/issues/1983

yorick.laupa · August 27, 2019, 2:17pm

If you want hit me direct on the persistent subscription issues you

had, I likely know that code better than most. Details/logs/etc can

help.

Thanks for tuning in!

The problem is I never got a chance to have a log when that persistent subscription issue happened. We might not understand how persistent subscriptions work, have messed up persistent subscription settings or there is a bug in those subscriptions itself. I think the first two scenarios might be the reason we are experiencing this. We switched, after
a brief exchange with you on Twitter a while back, to Pinned strategy. It reduced, even more, the likelihood of this behaviour to happen. We only
experienced this issue a couple of times this year.

re: updating settings, this should always work. Any update to

settings blows away the current state of the persistent subscription

(this is by design!). I actually tried to work out how to not

completely blow things away as quite a few run with large buffers, but

it did not seem fruitful when trying to balance the complexity of it.

As such they are literally just replaced.

I concur that each time we got an issue with persistent subscriptions, updating the subscription settings solved everything.

We are not afraid of replaying some events because whatever we do with those events are idempotent operations.

re parked messages: When you say more "straight forward access to

parked messages" what do you mean? They are just in a stream. The

stream they are in is included on the subscription definition (but

also follows a pattern which is not likely to be changed anytime

soon). What do you mean by more straight forward access beyond being

able to read the stream?

I’m aware of the $persistentsubscription-{streamId}::{groupId}-parked. For example, I’d prefer a link next to Replay Parked Messages that point me directly to the parked event stream.

I also find the few stats on persistent sub in the UI a bit confusing. Sometimes, I have more parked messages compared to what is showed in the Web UI (by a long shot). Also, maybe

the Web UI could display more information related to persistent subscriptions if there is any.