Virtual Actors for active streams

Hi all,

I was wondering about the following and would love some input.

We have mutilple different stream types and some stream get accessed a lot more than other and in some cases we need a lot of performance on them.

So the scenario is as follows, we have a stream called finance-guid and this contains some information of a clients financial information. We have a stream per client, when modeling some future data on this stream a lot of reads and appends happen. We follow the pattern of snapshotting these streams per 500 events, this gives us almost all the perf we need.

Generally what happens in our world is that external systems would trigger the modeling part and the following happens, in our domain we have a couple of methods that would add or modify specific areas of the object, ones these are petfomed we call a calculate method and then this is returned.

So we load the stream from ES, either partially if we have a snapshot or not, then we perform some operations, perform the calculation and then persist again.

The area of issue is that in most cases the happens quite a lot in a short space of time, and getting the aggregate and saving is causing some performance to be degraded.

We can scale most parts of our APIs but we do want to think about costs as well.

Anyhow I am pondering the idea of using virtual Actors, so for instance if a client is now going to start modeling, what we would do is get the aggregate from the snapshot and the rest of the events from ES, and load that into a actor, the operations would happen on it and once the actor is deactivated we would get the new events to apply from it before deactivating and then persist to ES.

I believe this approach would work as the framework we are using for actors guarantees reliability as well as durability and also only having one instance of said actor available?

What do you guys think?

We would definitely have the perf that we require as it would be in memory, we would not need to rehydrate our aggregates the whole time and lastly we would still be persisting events to ES, but only after deactivation instead of after each modification.

not entirely sure what you mean by ? only saving when stopping the Actor ?

it and once the actor is deactivated we would get the new events to apply from it before deactivating and then persist to ES

what happens generally , once the actor has been spawed & ready to accept message ( I;e the entity has be setup by reading the streams / snapshot+rest of events)

  • An operation is sent to that actor
  • Events are generated & persistend, the optimistic version is updated

waiting for the actor to be stopped to persist the events might cause problems:

  • catastrophic failure ( what if the underlying machine just dies…)
  • downstream consumers won’t get events until persisted.

Hi @yves.lorphelin yeah I gave it a bit more thought and I would say that was a bit of a bad design.

I think that I am going to change it, so once the operation has completed, we would “drain” all the events that have occurred.

I think the actor will just be the in memory representation of the aggregate, that will live until it is not used. Then on deactivation just do a surity check that a there are no more events left, but this should never happen.

:+1: exactly yes, that’s what I’ve seen happening in the wild

1 Like

When I was building actor model-based event-sourced systems, my actors were persisting all the events. There’s a risk that if you don’t do that, you’ll lose business events if the actor crashes for any reason. Expecting an actor to always shut down gracefully is dangerous.

1 Like

Just a question around that, so are you saying that the actor and not the service calling it, has the responsibility to store the events?

As I read actors are single threaded and should not really have any direct dependency in them, but rather state is given, process is run and state is stored by the caller which is then multithreaded.

I am just worried that I might lock up the actor if it has the responsibility of storing events as well or am I misunderstanding you?

keep it simple : the actor is a host for the state & indeed after finishing work appends the data .
anyhting that , inside the actor runtime, needs to react to that is fed events by a subscription

1 Like

Don’t know where it comes from. Actors don’t run in threads. An actor processes one message from its mailbox when there’s a message, then goes to sleep. Having a thread per actor would make impossible to build systems that are usually built with actor models (potentially millions of actors).

There’s no issue for an actor to have dependencies. You can separate a writer to another actor, but that one will then have a dependency. Someone must have persistence dependency. Moving the persistence outside of the actor boundary makes the persistence unreliable, also slower, without any benefit at all.

You can check Proto.Actor docs with regards to persistence. I also did a talk about it. Had some technical issues during the talk, so it is chaotic, but it might help.

I think I might have confused what I meant with single threads.

The single threaded comes from service fabric actors, they are single threaded in the sense of only one process can access it at a time. If there are more than one they then Form a queue before the actor.

Makes sense to keep everything within the actor itself, but now comes the question, how many subscriptions can a three node cluster handle? We can additionally add another two read only nodes as well.

We usually have about 4000-5000 active aggregates at a time at peak loads, but this grows as more clients get onboarded.

I know that we can scale, but I just want to know if there is a theoretical or practical limit with a three node cluster?

Take a look what Commanded does in Elixir land: https://github.com/commanded/commanded/blob/master/lib/commanded/aggregates/aggregate.ex

Hopefully it is easier to follow the code than me explaining things.

Every aggregate is a GenServer/Actor and you get to configure the lifespan of it based on whatever criteria you want.

At least in Erlang, even 2 millions of actors would be nothing as long you have the proper clustering and/or resources such as memory to handle it.

2 Likes

I guess the question is not about how many actors you have on a three-node cluster, but how a three-node ESDB cluster would behave. We’ve built a system with actors a while ago. It normally had 100K active virtual actors during normal operational hours, peaking at maybe 200K. It wasn’t really an issue, but we persistent too many events, so we moved to use BigTable for persistence. Now we are reconsidering getting back to ESDB, as we can cut the number of events down by tenfold.

The issues are kind of known. It’s the warmup time mostly when lots of actors need to load their state after one node disappears, so actors get moved.

Still, there’s one huge benefit of using the actor model, you only append events after the actor state is loaded, compared with a “stateless” system when you read and write all the time. That’s what I tried to explain in my talk. Keeping events in memory and dumping them when the actor shuts down is not an option for us, as it’s very unreliable.

I have plans complementing Eventuous with Proto Actor integration, it should be quite easy to do. One thing I am definitely not going to do is to use actors as aggregates. Actors will be stateful app services that handle commands, keeping a single aggregate instance in its state.

1 Like

What was the platform if you don’t mind answering?

Definitely depends of the tools available such thing could be a nightmare, a lot of actor frameworks come and go selling speed, among other things…

While Erlang peeps keep observing the failures because exactly that: reliability. If you can’t trust your actors to be fault-tolerance, then all you got is a fancy queues with identities so communicate to them directly in most cases.

The one part that most people oversee and is one the reasons why I am quite happy in the BEAM, was built around the fact that the systems will fail, so the architecture of the entire VM and programming language takes such thing into consideration. Hard to make it work in languages that weren’t built for it.

High expectations for ProtoActor for sure

I totally agree with your last comments. I don’t forsee ever getting to so many actors, as the system is not used continuously, but rather more of a login-do some work-logout type of system.

I will go and watch your talk again as I think I missed a couple of things.

Can anybody contribute to Eventous?

I cannot answer for @alexey.zimarev but we used Service Fabric Reliable actors that are partitioned and backup states are kept across multiple regions and zones.

We have one active primary, two secondary primaries, and then another 6 passive secondaries per actor. The requests are resolved only on the active primary.

In the 5 years that we have used SF none of our reliable services ever lost state and reliability was guaranteed the whole time. Service Fabric is also the service that powers Azure well about 70% of it, even Azure Kubernetes runs on top of SF in Azure, so if it’s good enough to power Azure it’s good enough for us.

Of course :slight_smile: It’s open-source, but atm I am the one who decides what comes in.

Azure is a public cloud, and this very fact sets certain requirements for the tools used internally. As you won’t use Kubernetes to host an SPA or static a website, maybe what’s good enough for Azure isn’t necessarily good enough for you as it’s just too much?

Can you point to a resource that tell the story of AKS running on SF? I am sorry to be maybe annoying in that regard, but I don’t think it’s possible.

My experience with SF was not the best. SF is heavy, local development sucks, state management is awkward, etc, etc.

Not sure I got your question right, but it was (and is) a Proto Actor-based system running in Kubernetes. We have quite a few of those. Roger worked for us for a year, so we got some competence in there.

Yeah sure here is the article and I have heard this many a time from our account manager over at Azure.

https://www.isolineltd.com/blog/2020/is-azure-service-fabric-still-relevant-in-2020.html#:~:text=Service%20Fabric%20is%20definitely%20not,there%20are%20reasons%20for%20it.&text=Even%20some%20parts%20of%20Kubernetes,on%20top%20of%20Service%20Fabric.

I agree the dev experience was awful, but now with the latest SDKs and the OSS of SF, it has turned dramatically. It is really something that is overlooked most of the time due to Kubernetes and others.

Yes you get tied in to most things that SF offers but if your services and apps are written correctly you can get around this, and if SF dies transitioning to Kube and the likes is quite simple and easy to do.

Yeah, I’ve seen that one :slight_smile: They say “some parts of AKS”, which is understandable.

That’s true, reading it again. Still for the most part there seems to be some good value in the product and model. It’s lonesome world in SF, but moving years of work is just not an option, and currently not feasible as knowledge on Kube is not there and I’ve heard a lot of things that you have to bring your own tools to the model itself. Anyhow I think I’m drifting again, thanks again for the input, will be sure to see if I can contribute to Eventous.