Hey ESDB team,
We’re working on a project to migrate an old payment system to a new in-house system and we’re using ESDB for this project. The pipeline is roughly
- Persist the event to ESDB (ex. Deposit)
- Persist a transaction record to MongoDB (read model)
- Update the account state in MongoDB (read model)
At the beginning of the project we need to bring in a data from the legacy system, around 70M transactions, provided in time series order.
We found that running these through our system (persisting events one at a time) was quite slow. This is primarily because we need these transactions to occur in order (at least at an account level). So we are considering a couple of other options for importing the data and would like to know if you have any recommendations.
A couple of ideas
- Persist events in batches (say 1000) to ESDB by pushing them to a single stream that then emits the events to the individual account streams. (Will this save much time?)
- Bulk write into MongoDB which is fairly quick. As part of this either:
- Persist a single event per account to ESDB representing the final account state, reducing events from 70M to ~1M
- Or persist all of the events, but don’t worry about emitting to update the read model during this initial import
Is there any better option we’re missing? We looked at replicator, but it doesn’t seem well suited to this purpose. Any advice is appreciated.
- Persist events in batches (say 1000) to ESDB by pushing them to a single stream that then emits the events to the individual account streams. (Will this save much time?)
Probably not, I’ve done that for information that was received every 6 months .
The reason we did that back then is because of some pre-processing that needed to be done on the batch itself , individual items afterwards could be processed in parallel.
- Bulk write into MongoDB which is fairly quick
- Persist a single event per account to ESDB representing the final account state
You’ll loose individual account transaction, that might be ok or not depending on your requirments.
we need these transactions to occur in order (at least at an account level).
Persist events in batches (say 1000) to ESDB by pushing them to a single stream that then emit
Third option : in the import , reorder par account an do a batch append to the account streams , not an import stream
and yeah, for building the initial read model in Mongo bulk write is the best imho .
What is your source data store ?
Unfortunately our source data is going to be provided via a ton of CSVs since we do not have access to the third-parties infrastructure.
I will see if there is some way we can group some CSVs and batch import them, but it’s unlikely that we can get all of that into memory since most accounts will have around 0-2 transactions per day and we’ll likely have hundreds of these CSVs representing daily transactions.
Thank you for the insights!
if I understand correctly: you
- have some (large) initial import to do
- then get daily imports.
- in the form of CSV
- containing daily transactions
- lots of transaction, though max handful of transaction per account per csv
- meaning lots of accounts.
Quick math
5 transactions per account for 350 days= 1750 a year per account
you don’t need to load the full CSV in mem .
you could preprocess them like this :
- read each original CSV line by line (well multiple at once
- and copy over in to an account specific per year CSV ,
^that should be quite fast .
then you can import those into ESDB in a batching fashion
Assuming millions of accounts, I’d do this work in a pipeline fashion : process original CSVs , when there is enough output begin importing while continuing transforming the original CSVs
This is very similar to what I proposed to do, so hopefully this works out. Thank you for the help!