Bulk importing events

sean.payne · July 21, 2024, 11:24pm

Hey ESDB team,

We’re working on a project to migrate an old payment system to a new in-house system and we’re using ESDB for this project. The pipeline is roughly

Persist the event to ESDB (ex. Deposit)
Persist a transaction record to MongoDB (read model)
Update the account state in MongoDB (read model)

At the beginning of the project we need to bring in a data from the legacy system, around 70M transactions, provided in time series order.

We found that running these through our system (persisting events one at a time) was quite slow. This is primarily because we need these transactions to occur in order (at least at an account level). So we are considering a couple of other options for importing the data and would like to know if you have any recommendations.

A couple of ideas

Persist events in batches (say 1000) to ESDB by pushing them to a single stream that then emits the events to the individual account streams. (Will this save much time?)
Bulk write into MongoDB which is fairly quick. As part of this either:

Persist a single event per account to ESDB representing the final account state, reducing events from 70M to ~1M
Or persist all of the events, but don’t worry about emitting to update the read model during this initial import

Is there any better option we’re missing? We looked at replicator, but it doesn’t seem well suited to this purpose. Any advice is appreciated.

yves.lorphelin · July 22, 2024, 11:44am

Persist events in batches (say 1000) to ESDB by pushing them to a single stream that then emits the events to the individual account streams. (Will this save much time?)

Probably not, I’ve done that for information that was received every 6 months .
The reason we did that back then is because of some pre-processing that needed to be done on the batch itself , individual items afterwards could be processed in parallel.

Bulk write into MongoDB which is fairly quick

Persist a single event per account to ESDB representing the final account state

You’ll loose individual account transaction, that might be ok or not depending on your requirments.

we need these transactions to occur in order (at least at an account level).
Persist events in batches (say 1000) to ESDB by pushing them to a single stream that then emit

Third option : in the import , reorder par account an do a batch append to the account streams , not an import stream

and yeah, for building the initial read model in Mongo bulk write is the best imho .

What is your source data store ?

sean.payne · July 23, 2024, 12:10am

Unfortunately our source data is going to be provided via a ton of CSVs since we do not have access to the third-parties infrastructure.

I will see if there is some way we can group some CSVs and batch import them, but it’s unlikely that we can get all of that into memory since most accounts will have around 0-2 transactions per day and we’ll likely have hundreds of these CSVs representing daily transactions.

Thank you for the insights!

yves.lorphelin · July 23, 2024, 9:15am

if I understand correctly: you

have some (large) initial import to do
then get daily imports.
in the form of CSV
- containing daily transactions
- lots of transaction, though max handful of transaction per account per csv
- meaning lots of accounts.

Quick math
5 transactions per account for 350 days= 1750 a year per account

you don’t need to load the full CSV in mem .
you could preprocess them like this :

read each original CSV line by line (well multiple at once
and copy over in to an account specific per year CSV ,
^that should be quite fast .
then you can import those into ESDB in a batching fashion

Assuming millions of accounts, I’d do this work in a pipeline fashion : process original CSVs , when there is enough output begin importing while continuing transforming the original CSVs

sean.payne · July 24, 2024, 2:01am

This is very similar to what I proposed to do, so hopefully this works out. Thank you for the help!

system · January 20, 2025, 2:02am

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.