Scavenge not reducing diskspace

ken.faulkner · December 6, 2020, 7:53pm

Hi

I feel I’m going over already discussed ground (but cant find any specifics threads), but I’m performing a scavenge on a cluster where a LOT of streams (containing plenty of data) have been deleted. I perform a scavenge on one of the nodes (just going through one at a time) and although it completes successfully (over 24 hours later)… the amount of drive space freed up is miniscule.

Back history. Imported 69M events from another source into ES. This caused the ES disk space to jump from about 12G to well over 100G. Has since been decided that we don’t have to have those new events in ES (and it was causing issues for us), do each of the new streams (all new data went into new streams) were soft deleted. I had 2 expectations… 1) on the next scavenge ES would free up space used by those streams BEFORE the soft delete mark… 2) I’d get back down to about the 12G level.

The scavenge has reduced the db directory down to about 90G… so technically lower, but not by much. And certainly nothing near the 12G from before the import.

One thing I should note, although the streams were deleted (soft delete) with the mass migrated data, new/fresh data has since been added to those streams. So an individual stream might have tens/hundreds of thousands of events BEFORE the soft delete mark… but only a few thousands after. Unsure if that would make a difference, but wanted to note it here.

Is there any obvious step that I’ve missed? Scavenging would free up disk space fine prior to this large (for us) migration of data.

Any help highly appreciated.

Thanks

Ken

hayley.campbell · December 7, 2020, 7:32am

Hi Ken,

Do you have projections or anything that may have created linkTo events from the imported data?

If you do, then you may have run into the issue described here: https://github.com/EventStore/EventStore/issues/2626. In short, linkTo streams aren’t scavenged away or deleted when the original stream is.

If this is the case, you may be able to recover more space by manually deleting any linkTo streams created from the imported ones.

ken.faulkner · December 7, 2020, 7:52am

Hi

I purely have the standard 5 internal projections running, which I believe create linkTo streams. Reading that github issue I’m really unsure if we’re experiencing that or not. Since yes, we make heavy use of the $ce- linked streams in particular… Is that GH issue saying that say the millions of events that are linked in the $ce- streams won’t get deleted? Or was it just the last event of each stream? If it’s just the last event then that can’t be contributing to the 70+ G that is being used up.

Can you clarify that GH issue?

Thanks

Ken

hayley.campbell · December 8, 2020, 10:02am

The github issue is saying that the linkTo events created for the millions of events that were linked in the $ce- stream have not been deleted.

LinkTo events are still events that take space in your database, and each projection will potentially write an event to its relevant stream.
These linkTo events are not removed when the underlying stream is removed, they simply become unresolvable.
You would be able to see these events if you were to browse a $ce- stream for some of the data that was removed after importing.

There is no mechanism to remove these linkTo streams automatically, but if you can manually delete these streams you should be able to recover some more space.

ken.faulkner · December 8, 2020, 7:18pm

Hi

Hmmm well, when I look at the $ce- stream I definitely don’t see any of the old deleted events. Do I need to manipulate the URL to see them?

Thanks

Ken

hayley.campbell · December 9, 2020, 8:17am

The easiest way to check this would be to:

Identify a category stream that would have data that was removed after the import. For example, if some of the imported events were written to a stream invoice-{invoice_number}, then the corresponding category stream would be $ce-invoice,
Read through the $ce-invoice stream from beginning to end with ResolveLinkTos enabled.
Check whether the events returned by the reads are resolved. If you’re using the TCP client, you can check the IsResolved property on each event.

It’s also possible that you have linkTo events that were created by the other projections such as $et-{event_type} streams taking up a lot of space. Checking these would be the same process as above.

ken.faulkner · December 11, 2020, 1:26am

Hi

Ahhh definitely seeing it now. Even if the original event is deleted, even just totalling the “data” and “metadata” fields is certainly adding up. Currently have a program calculating how much space we should get back… think it will take a while to complete.

Is there any special way to delete these linked entries? I’ve only ever performed soft deletes in the “real” streams and never with these linked streams.

Any recommendations/thoughts?

Thanks

Ken

hayley.campbell · December 11, 2020, 8:47am

Hi,

Deleting linked streams is the same as deleting normal streams. You can simply soft delete them and run a scavenge to remove them.

The only thing to be aware of is if you have multiple streams linking to the same linked stream, and not all of the original streams were deleted. This could result in there being some events that still resolve and some that don’t, and deleting the stream will remove the link events that still resolve along with the unresolved ones.

ken.faulkner · December 14, 2020, 3:07am

Hi Hayley,

Hmmm given that these streams have been reused since the initial deletion, I’m wondering if that’s going to cause a problem. Situation is orig streams soft deleted, then populated with new/valid data. To me (after a bit of testing) it means that I can’t actually delete/free up $ce-* due to it being populated with new linked events. I was picturing that the delete API call would allow “expected event” to be anywhere in the timeline and simply anything before that would be considered soft delete. But after my tests I’m assuming that’s completely wrong.

How did you managed to do it? Did you just make sure you cleaned up the $ce-* before adding new data (if so, too late for me)… or did you manually add the $tb somehow?

Thanks

Ken

hayley.campbell · December 21, 2020, 10:51am

You can set the TruncateBefore ($tb) values manually. Please see the docs here: https://developers.eventstore.com/clients/dotnet/5.0/streams/stream-metadata.html#writing-metadata

Set the TruncateBefore to the first event you want to keep, and then all of the events before that will be eligible to be scavenged.

ken.faulkner · December 21, 2020, 7:35pm

Ahh so it was a manual process. Cool… will try that out on a test cluster today.

Fingers crossed

Thanks again!

Ken