Any recommendations on how to speed up building of views from a large set of events?

Tomas_Jansson · September 19, 2014, 6:47am

We’re about to migrate some data, not that much if you ask me, from a sql database to eventstore. To do so we have a small project that extracts the relevant data from the SQL database and inserts it into eventstore. The first part takes about 15 s to complete, as I said, not that much data.

We have also implemented some services that subscribes to events and build up views as events are written, that works fine under normal load, but now when we have an initial larger set of events it takes for ever. The views we build up are one in a sql database and one in a elasticsearch database. I’m pretty sure it is because we are only consuming one event at a time and that causing many reads and writes against the view databases since each event results in one lookup and then one write/update.

Is there a smarter way to do this? It must be. If possible I would like to have the save solution for the migration part as well as for the live running part, the views should be updated the same way disregarding if it is an event resulting from an migration or from the actual system.

Tomas_Jansson · September 19, 2014, 7:40am

The solution is: don’t log things for each event… doh! We had some debug logging that we forgot to turn off for the migration.

Ruben_Bartelink · September 19, 2014, 8:04am

The solution is: don’t log things for each event… doh! We had some debug logging that we forgot to turn off for the migration.

That’s often an excellent perf optimization technique - I’ve seen it applied with excellent effect on numerous occasions too

Another one that’s underrated is to have your read model be entirely in memory and built from the ground up each time. Obviously this technique has its limits but as long as

a) the state you are keeping is sufficiently compact as to fit in a reasonable portion of your memory

b) the number of events you have (and the size of them over the wire) / your ability to filter them (e.g. if you have a natural boundary such as a trading day and can hence segregate of filter the events on the way) is a good match for how soon you need the in-memory state to be fully “up to date” / “consistent”

Once you have in-memory state, you can also do snapshotting (incorporating the “last seen event”, either as a single blob or split in some way that’s appropriate for how your reading is (e.g. per tenant))

I saw a very good writeup expanding on the above recently but can’t for the life of me remember where - if I recall, I’ll post here.

–Ruben

Tomas_Jansson · September 19, 2014, 8:14am

Glad I’m not the only one :).

I’ve actually thought about having the read model in memory for the part we are still using SQL for, but haven’t done it. Since we want to have elasticsearch for search it’s naturally to not have that in memory :).

Please post the write up if you find it!

Jonathan_Curtis · September 19, 2014, 8:48am

I asked a similar question here: https://groups.google.com/forum/#!topic/event-store/aXy9iVObqRY

Basically, batch events in a buffer and flush based on batch size or a timeout. It would be nice if the subscription handled this for you. Elastic search has a nice bulk api that will greatly speed things up. With SQL, it depends on how you are accessing it, but most ORMs have batching or you can roll you own hand coded sql to batch.

Greg_Young1 · September 19, 2014, 11:27am

In general understand in projections the difference between a replay and live processing. Many projections can limit their output (updates in memory then smaller updates to persistence)/switch to batched inserts for isnerting when you are in a replay mode.

Cheers,

Greg

Peter_Hageus1 · September 19, 2014, 11:30am

I’ve had the same issue, the key points was:

Buffer and write bigger transactions (but not too big!).

Limit memory usage, make sure you don’t queue up too many events waiting to be written, or too big in-memory readmodels. MS SQL performs very poor under high memory preassure.

Cache you id’s so you know if you’re about to do an update or insert (unless you can deduct that from the event, and always do full replay)

/Peter

Sebastian_Good · September 19, 2014, 12:59pm

In memory read models are awesome for several reasons

No “migration” scripts required - easy to develop
Blistering performance

But have a realistic idea of how much time it will take to spin them up. It will be very fast at the beginning of your development. But if some read models need to read many or all streams, it will eventually take quite a bit of time. (A classic example would be having one stream per user project, but then needing a read model that showed the list of all user projects by name and last modified date. You need to read from all the project streams.)

Cleverer stream querying (can you look only for streams that start with “project/”?) helps, but ultimately if you have millions of events to process, you’ll want to persist your read models so every app startup doesn’t take 20 minutes. This is where GES projections are handy as they automatically track progress against streams. But they have their limits – keeping track of all the projects in your model is probably not a great use.

Greg_Young1 · September 19, 2014, 1:18pm

There will soon be lucene built in as well for light querying purposes (eg show me projects that start with the letter q)

Sebastian_Good · September 19, 2014, 1:27pm

I can’t wait to see how this shakes out. I’m guessing 80% of the drudge-work UI-related read models people need will just magically work with this feature – like “list all my projects”.

One of the great things about ES in general is the full history it provides. I confess I haven’t much thought about how well Lucene would provide support for this, but it would be really quite wonderful if we could query historical versions of such read models. Getting temporal support right in databases is really hard, but with ES you have a built-in advantage. Have you looked at this much? I’d imagine in what seems to be your bread and butter industry, this would come up a lot. (“What were the outstanding orders at 11:23:52.934am”)

Tomas_Jansson · September 22, 2014, 8:14am

Is there a built in way to differentiate between replay and live processing in a subscription? I mean, can I by looking at EventStoreCatchUpSubscription see if I’m in replay mode or live processing? I guess the answer is no since you are always sort of catching up in a subscription.

Greg_Young1 · September 22, 2014, 10:06am

There is an event on the subscription for this:

https://github.com/EventStore/EventStore/blob/dev/src/EventStore.ClientAPI/EventStoreCatchUpSubscription.cs#L53

It lets you know when you have gone to procressing live events. There may be some number of things queued at this point but you are on live processing

Tomas_Jansson · September 22, 2014, 10:26am

Cool!

So then I can just use that for batching things up while replaying. Nice!