Scrambling Data in EventStoreDB – Best Practices and Implementation Strategies

mhoch · September 7, 2024, 11:05am

I’m currently working on a solution where I need to scramble data stored in EventStoreDB. I understand that one approach is to publish new events reflecting updates to an aggregate’s current state. However, I would also like to explore ways to effectively scramble the data in the original events themselves.

Here are a few specific questions I have:

Scrambling Existing Events:

What is the best practice for scrambling or obfuscating data in existing events that have already been stored in EventStoreDB? Is it possible to modify past events while maintaining the integrity of the event log, or should I create new events to represent the changes and ignore or delete the original ones?

Data Compliance and Auditing:

In scenarios where scrambling is required for compliance (e.g., GDPR, CCPA), how do you ensure that the historical audit trail remains intact while also adhering to the data scrambling or anonymization requirements?

Reprocessing vs. New Events:

Is there a preferred method for handling scrambling at the event level, such as reprocessing old events or emitting new “scramble” events? How does this approach align with event sourcing principles?

Performance Implications:

Are there any known performance impacts when dealing with large datasets and applying a data scrambling mechanism? If so, how can they be mitigated?

Scrambling Sensitive Data:

For sensitive data fields in the events (e.g., PII), what encryption or scrambling techniques would you recommend that work well with EventStoreDB’s architecture?

Tools & Libraries:

Are there any existing tools or libraries specifically designed for scrambling or anonymizing event data in EventStoreDB that you could recommend?

I’m looking for a solution that balances compliance with data privacy laws and best practices within event-sourced systems. Any insights, strategies, or experiences from the community would be greatly appreciated!

Thanks in advance!

yves.lorphelin · September 8, 2024, 6:02pm

Obligatory note: I’m not a jurist or specialist in those matters: check with your legal team , security team , compliance team , CISO and other relevant teams or persons to understand how the regulations applies to your business, processes & applications

First some general comments & questions:

Where does the question from encrypting data comes from ?
What are the requirements around compliance & auditing , what do your security team & CISO expressed as needs ?
What are the exact legal obligations ? Does your field of operations have stricter rules ? Does you company already have some certifications or compliance in place ?
- Some might overlap with GDPR / CCPA or give a head start .
the general guidance is the same as for any data store or application
- if you can use encryption at rest at the OS level.
- model the data in such a way as to separate PII / sensitive data from the rest.
- you need to have most of your processing capable of doing the work without the the PII/ Sensitive data.
- having a solid archiving capability helps , because it is essentially the same kind of processing .
  - archiving as in removing unused data from the operational database ( not “tiered storage”, not “backup”)
on GDPR ( I don’t know about CCPA)
- Keep in mind it’s not only about forgetting data
  - it’s about knowing where the data will be processed, what data exists and having policies around how long it’s kept as well.
- Keep in mind scrambling , encrypting data is not enough or the only way ( e.g. separating PII from the rest)
  - you need to be able to detect breaches, unauthorized access ( at the app level as well as infra)
  - downstream consumers ( think BI / data analytics ) need to also comply with GDPR requests.
- Sometime you have overriding rules ( I’ve been in places where we are legally bound to keep records for many 10’s of years , no GDPR ‘forget me’ request is allowed on that data)

Scrambling Existing Events:

The data is immutable , you can not change it.
you could rewrite your streams ( with a marker event at the end , before rewriting) and then truncate those streams.
There are other ways , it depends on how your streams & app design look like

Data Compliance and Auditing:
In scenarios where scrambling is required for compliance

Check with your legal team if encryption of the data is actually written down in the regulations .
Afaik , GDPR does not require the data to be encrypted

As for EventStoreDB : the data in events are just bytes, if you scramble / encrypt them they are still bytes. The only issue is if you use user projections, the data will not be decrypted by ESDB .

Reprocessing vs. New Events:

see response to 1.

How does this approach align with event sourcing principles?

In event sourcing we only add information and eventually forget about old one.
So it aligns.

Performance Implications:

If you use encryption there will be a performance penalty ( mostly CPU bound ) and maybe a size on disk as well ( depending on the algorithm used)

Scrambling Sensitive Data:

I can not recommend any specific encryption mechanism , that is something that needs to be checked with your security team.

for ESDB : the event data are just bytes, so encrypted or not does not matter (except for custom user projection that expects JSON)

Tools & Libraries:

Rule of thumb: Stick to the tools and libraries provided by the stack you use or well known supported and maintained libraries
( you didn’t mention the stack you use )

Note:
ESDB will soon have an encryption at rest feature, probably a paid feature, meaning the bytes of the data files will be encrypted & decrypted by ESDB.

Those 2 resources might help as well