Is there a limit on (projected) stream size? How are categories determined?

urbanhusky · April 27, 2016, 6:42am

Hi,
I’m currently evaluating Event Store for use in our DDD + CQRS/ES architecture. So far I’m very excited to look deeper into it, but I do have a few simple questions:
Is there an upper limit on how many events a stream can contain? I’m aware that there is a 2bn limit on the event Id per stream - but what about projections with linked events?
I’m asking because I thought about using Event Store for messaging between bounded contexts - i.e. I need to be able to get all events for certain aggregates. Example: I thought about writing the per-aggregate events to streams such as {boundedcontext}{aggregate}-{id} - which should mean that I could just subscribe to all events of the {boundedcontext}{aggregate} stream in order to initialise and update my new bounded context (which would then handle and transform the event into a context-specific event and update the read store in the new BC accordingly).
The alternative would be to add some reliable messaging between the contexts - but Event Store is already a pub/sub system…
*How are categories determined?*Is this just a split on the last “-”? Can I safely use underscore like above, potentially creating categories such as {tenant}{boundedcontext}{aggregate} for streams {tenant}{boundedcontext}{aggregate}-{id}?
Or would it just be best to write my own projections for this?

Regards,
Thomas

Greg_Young1 · April 27, 2016, 6:45am

You can define category seperators but they are - by default.

There is a limit on 2b events/stream. LinkTos count as events. We have
discussed making this limit go away in the past. Are you likely to hit
it?

Greg

pieter.germishuys · April 27, 2016, 6:55am

The built in system projections that splits the streams up into a category can be edited to use your own separators if you do not like the - as a category separator.
e.g. the $stream_by_category projection has a body of “first\r\n-” which indicates that it should split the stream by the first - it finds. The possible values for first is first and last and the separator can be anything sensible such as the underscore you mentioned.

urbanhusky · April 27, 2016, 7:03am

I’m afraid that we might hit it for very busy aggregates. We’re migrating from an older CRUD based system and some tables have >20 million rows of data - and since we’ll do event sourcing, that number is only the initial set, not the granular events on these aggregates that will follow. This might be unwarranted but I don’t have any other metrics I can base the estimation on.

Maybe I’m approaching this from the wrong angle. After all, I would need to solve a similar issue for my own context in order to build and update the read models or index of read models. Both because I need to update a “list” read model (otherwise we wouldn’t know what aggregates we have - i.e. the user wouldn’t be able to select anything from a list) and because we wouldn’t subscribe to a per-aggregate stream (20+ million subscriptions seem a bit unlikely).

What alternatives could there be? Messaging alone wouldn’t give us an index of what we’d have to read in case we have to rebuild our read store - and N+1 select doesn’t seem like such a good idea for 20+ million aggregates.

Greg_Young1 · April 27, 2016, 7:07am

Try from all and drop ones your aren't interested in. Providing you
are under a few thousand events/second this is
network-stupid/complexity-smart eg you send over events you don't need
and drop them but have a simple subscription model.

urbanhusky · April 27, 2016, 7:34am

I don’t think that listening to all events and then dropping only the ones I’m not interested in would be reasonable - especially in catch up scenarios.
I’d either have to build my own pub/sub system after that or have each context or read model generator etc. listen separately…

I’ll have to do some calculations to estimate the number of events we’ll generate and look into more detail how much of that migration data represents historical data (i.e. events) and how many actually correspond to aggregate instances…

Greg_Young1 · April 27, 2016, 7:50am

"I don't think that listening to all events and then dropping only the
ones I'm not interested in would be reasonable"

Have you benchmarked? We have benchmarks on catchup subscriptions +-
70k/events/second persistent subscriptions faster.

urbanhusky · April 27, 2016, 8:07am

I’ll have to look into that. I wouldn’t need to have every subscriber listen to all events - “only” the ones that are not partitioned by aggregate instances. So all list read model generators and cross-context message propagation.
That would leave a few tens of simultaneous all events subscriptions - at least one per list read model and one per aggregate type per consuming context. I’m a bit worried about the initial burst of traffic and processing for (re)initialisation scenarios - which would happen whenever we have to deploy a new installation or whenever we added a new context.

urbanhusky · April 27, 2016, 11:03am

P.S.: What are the reasons for selecting this limit and sticking with it?
If we ever have to deal with any kind of real time information to process, this could get very difficult to partition and evaluate - forcing us to agonise over how we plan our partitions and how we can keep track of them…

Greg_Young1 · April 27, 2016, 12:21pm

"P.S.: What are the reasons for selecting this limit and sticking with it?
If we ever have to deal with any kind of real time information to
process, this could get very difficult to partition and evaluate -
forcing us to agonise over how we plan our partitions and how we can
keep track of them..."

Its very easy to partition by time period etc (eg every day etc).
Remember that ES also has essentially 0 cost of creating new streams
and can support millions and millions of streams

urbanhusky · April 27, 2016, 12:31pm

…but you’d still have to know the partition key so that you can find those events again.

Greg_Young1 · April 27, 2016, 12:34pm

Partition-1 (last event PartitionMovedTo { id : "partition-2" })

urbanhusky · April 28, 2016, 6:09am

Good point.
How could we write projections on such “unknown” partitions? fromAll and filter on event and stream pattern? Or could we use fromCategory (i.e. is fromCategory just using a predefined stream which is subject to the same restrictions, or not)?