New historical data incoming, rebuild everything?

antti · November 15, 2024, 12:00pm

Hi all!

I’m building a personal stock trading algorithm as a hobby project. I think Event Sourcing and EventStoreDB would be a perfect system for back-testing with historical data by replaying events from data sources like price changes, published news etc. and then testing “what if?”.

This involves saving historical data to ESDB as events. No problem thus far. Until in a couple of weeks I find a new data source I’d like to include. Because ES is append-only, I can’t add those new historical events where they belong without rebuilding the entire database from scratch.

Is there an approach that would allow me to add historical events in their correct position as I “discover” them or to achieve similar functionality some other way? Or should I just not be afraid of rebuilding the millions of events again and again? I could think of having a separate ESDB for each data source and sort the events programmatically when replaying the events. But it would be much nicer to not have to do that, I think.

I hope I was clear with my question!

yves.reynhout · November 15, 2024, 1:57pm

The order in which events are appended and stored in ESDB does not necessarily correspond to the order in which they occurred. The time at which events are appended and stored in ESDB does not necessarily represent the time at which they happened. To that end people often record the time at which the event occurred on the event itself. Our metadata exposes when an event was appended and stored, so you do not necessarily need to record that information (but you could).

From these stored events, with the time they occurred at in their payload, you can create an infinite number of simulation timelines and even persist them as a stream. Assuming events from data source A are stored in chronological order in stream A and events from data source B are stored in chronological order in stream B, you could read an event from the beginning of both stream A and stream B, decide which one came first, and persist it in stream A_B (note this could be a link to the actual event). If the event from stream A was first, you read the next event from stream A and compare it with the same (first) event from stream B, again to decide which one came before the other, and persist it in stream A_B. You do this until no more events are present in either A or B (I’m assuming these data sources are finite). Stream A_B is now your “simulation timeline” in chronological order. Should you learn of a data source C, you can write a new simulation timeline into stream A_B_C. Whether to keep stream A_B around or not is then up to you (scavenging is your friend). At no time is it necessary to rebuild the database from scratch.

HTH,
Yves.