How to keep the ges DB small / manageable ... ?

Jerome_Rouaix · October 24, 2016, 3:43pm

Hello there,

We have now a small GES instance running on a Azure VM, and the little thing grown up to 40 GB in a month.

We see a few optimisation leads in order to reduce the size of the database :

put GES db in a compressed folder (saving around half the space on windows)
reduce the $maxAge meta of our temporary streams (obviously)
zip big events (let’s say bigger than 3KB) and copy some properties in headers to allow projections

(but will it save anything since the DBs is already compressed in the file system ??)

With all this solutions we should reduce this 40 GB to … 5 but here is the problem, it’s 1/1000 of our total volume, how will we handle 5 TB or more data.

And since our business should go very nice with the new version we are planning to deliver, how will we deal with 10 times bigger, 50 times more ?

We then plan to implement a ‘flatten’ functional event to shrink an aggregate to its final state (and $tb the past events)

Does all that sound like some good practices to you ? Or whould you have some brighter ideas ?

best regards,

Jérôme

Greg_Young1 · October 24, 2016, 3:50pm

"We then plan to implement a 'flatten' functional event to shrink an
aggregate to its final state (and $tb the past events)"

This can help for aggregates but what about for projections?

3KB is a pretty big event as well!

jen20 · October 24, 2016, 4:25pm

Are you running scavenge to actually free up the space?

If you run in a compressed directory on Azure, you will definitely need to benchmark performance. Azure storage is known to be problematic at the best of times, and I don’t know where compression fits into that.

Failing that you’ll likely want to look into some kind of sharding scheme.

James

Jerome_Rouaix · October 24, 2016, 4:29pm

Well, if we flatten a stream to it’s final state, this mean :

Projection will have to handle this event, even if in some case, it will just rewrite the same values.
Some projections won’t be able to full restore in case of lost data (change count stuff …, statistics) => so we’ll have to backup them
We have to find a solution to avoid “flatten” an aggregate some projections are not yet ‘consistent’.

And yes, some of our events are mainly “product description changed”, and some product seems to deserve an insane description, so more than 3KB it is.

Does a 50 database sound sane to you ?

Can we dispatch the db files in multiple disks ?

Jerome_Rouaix · October 24, 2016, 4:41pm

Yes, we scavenge.

Will benchmark the data compression and not really worry about that.

I agree, Azure disks are network attached tech only, so it has performances a usb key can challenge, for the price of a gold nuget.

Almost sure we’ll end up moving on an other solution in the future, but as we still are a little business, we prefer invest brain resource to code the product than managing the platform.

(and then we invest brain resource on lower this insane azure cost).

At the end of the year, the problem will still be how do we deal with this highly spaced optimized 50 TB database.

And we are not that fan of having to shard our system, because we’ll then have to shard also the readings.

We already have one actor by stream on projection side, but this actors are created by observing a single internally (GES) projected stream.

jen20 · October 24, 2016, 4:45pm

we prefer invest brain resource to code the product than managing the platform

In that case, take a hard look at AWS - you’ll spend a lot less time and effort for much better reward.

Poule_Dodue · October 25, 2016, 9:32am

is sharding on the roadmap for ES?

Jerome_Rouaix · November 1, 2016, 3:52pm

Some updates here.

Zipped file system on windows azure : running nice, CPU overcost negligible, DB size on disk 45% lower
Zipped some big properties (not used by the ES projections) on some events : impact not measured yet

so it’ll end up to write a zipped event in a zipped file system the first zip implementation (C# gzip) seems more efficient but perhaps this zip inception was not useful

On @PouleDodue question, sharding solution ? it doesn’t seem to be in the roadmap (https://github.com/EventStore/EventStore/milestone/10)

So if we need it in a few months (or years), can we discuss about some https://geteventstore.com/support/ to get sharding ?

Would it be possible to get a princing for such a feature ?

Poule_Dodue · November 2, 2016, 3:14pm

sure Azure is not the cheaper. You may look into renting dedicated servers for far better prices.

Greg_Young1 · November 2, 2016, 3:18pm

We had scoped out sharding previously. It really depends what parts of
the laundry list are desired (mostly in terms of management etc). Just
getting sharding working is pretty trivial.