Stream max length and Stream reduction

lighterRS232 · April 28, 2022, 6:30pm

Hello Guys!

Currently, in my company, we are considering to adopt event store for some features that we are extracting from our core service and would be great if you could clarify some doubts.

In a feature that we want to migrate we have three entities that are defined in the following way:

Region
  uuid: UUID
  name: string
  zones: Zone[]

Zone:
  uuid: UUID
  name: string
  region: Region

Activity:
  uuid: UUID
  name: string
  zones: Zone[]

Where we have the possibilities to add an Activity to a Zone and a Zone could be assigned only to a Region.
Then, we will have the following events:

AddActivityIntoAZone:
  zoneUuid: UUID
  activityUuid: UUID 

RemoveActivityFromAZone:
  zoneUuid: UUID
  activityUuid: UUID

We will have, more or less, 100 streams where each stream has between 20K - 300K events initially
and each stream will increment with 3 - 10K events x month.

I cannot split it into more atomic streams and It will be structured in the following way

V1-RegionA-ZoneX-stream

We are using Replicator and Kafka. The various topic will be consumed by different services and Kafka will have a forever retention policy to not flood the different services when some service need to replay all the events. We thought about this approach just to have a security that all the changes are correctly propagated and that would be easy to replay all the events in case of necessity.

I’m wondering if EventStoreDB could manage this kind of load in a long time span or if it is needed a stream reduction process (a sort of “Snapshotting”).

By stream reduction I’m intending on regroup events by creating a new stream that regroup all the events with a new type of event like:

AddMultipleActivityIntoAZone:
  zoneUuid: UUID
  activitiesUuid: UUID 

RemoveMultipleActivityFromAZone:
  zoneUuid: UUID
  activitiesUuid: UUID[]

And the events that aren’t part of regrouping, will be appended immediately after. The old stream will be stored into a S3 bucket.

If is needed a stream reduction, I know that the projections on Replicator could be helpful. To respect the event orders, We need to create the reduced stream, stop the functionality for some time, appending the events that We’ve not reduced and restart the functionality.
Furthermore, We need to delete all old kafka topic and change stream name on Replicator.

Is there a better and/or native approach that simplify the resolution of this kind of problems?

If could be helpful, I’ve created a little architectural sketch.

Thank you in advance for any answer or feedback

yves.lorphelin · April 29, 2022, 11:54am

I’m wondering if EventStoreDB could manage this kind of load in a long time span or if it is needed a
stream reduction process (a sort of “Snapshotting”).

The database is not affected by how long your streams are, but your application does.

I know that the projections on Replicator could be helpful.

Replicator can do some transformation, but not a projection .

I cannot split it into more atomic streams and It will be structured in the following way

It seems you are making your streams design naming equal to your topic in kafka .
for the sink ESDB-> Kafka I would use the routing feature https://replicator.eventstore.org/docs/features/sinks/kafka/#routing
and have a simpler stream design:

I’m making the following assumption/ Region & Zone are reference data, that might have their own stream, and given the more or less, 100 streams I would probably have 1 stream per region and all the zone in it

the stream desing for the activity would be
=> Activity-[ActivityUUID] with 2 events : Added ( RegionID, ZoneID , uuiD) , removed (RegionId, ZoneId, uuID)
this would make a lot of small streams, removing the need for snapshoting and more easily archivable

Kafka will have a forever retention policy to not flood the different services when some service need to replay all the events. We thought about this approach just to have a security that all the changes are correctly propagated and that would be easy to replay all the events in case of necessity.

can you be more explicit on “in case of necesity” and “need to replay all events” ?
Typically when adding and bootstrapping a new service, or in a recovery scenario, I would use a different channel ( i.e push event to the service directly from ESDB) and when they are up to date go for the regular subcription on the kafke topic

lighterRS232 · May 2, 2022, 1:21pm

Replicator can do some transformation, but not a projection .

Sorry, I was meaning transformation. I wrote a typo

Activity-[ActivityUUID] with 2 events : Added ( RegionID, ZoneID , uuiD) , removed (RegionId, ZoneId, uuID)

I’ve already thought about this structure but It’s more ““complex”” to handle a recovery scenario where, for example, I need to re-propagate all the activity related to a Zone and/or a Region.
I’ve structured it in that way just to have a simple architecture that allows replaying all the events without implementing further logic on the consumer side that will handle all the events in an idempotent way. So, in case of a problem, I need just to delete the specific topic and reload all the streams related to that topic. With the approach you’ve suggested, I need to send all the messages to Kafka that everyone will consume.
But, in this way, the data on EventStoreDB would be handled better and we don’t need to do that snapshotting logics that I’ve explained before.

We have some constraints relative to the team seniority and we want to simplify as much as possible all the cases just to let consume and recover the data easily moving the complexity to the service that handles and store the data on EventStoreDB.

We need to consider the various trade-off here and decide which one is better for our requirements. Maybe doing a PoC could be helpful.

Thank you for your answers and help!

yves.lorphelin · May 2, 2022, 2:18pm

a poc will certainly help decide.
you’ll still need to have an out of band process for recovery, so that might be the thing to test out with the 2 stream design (I"m not seeing any issue with a new region-zone in the system )
in both stream design

yours : you’ll need to read from a specific stream and move it to Kafka
mine : youll need to read from $all and filter out before appending to Kaka