Search code examples
gitgit-filter-branchbfg-repo-cleanersvn2git

How to delete old/historic Git objects which are no longer in HEAD?


I have a Git repository with nearly 300k commits of over two decades. This repository is the result of a migration from SVN to Git, so I'm free to rewrite history. It has a single branch only and no tags.

Very often, files were added in commits and then removed later on, leaving objects in the repository which I don't want to keep.

How can I unconditionally delete all objects that are no longer in HEAD?

Things I've tried:

  • Both BFG and git filter-branch do not seem to cover this use case (without me writing a script that retrieves all the old object IDs).
  • The Git repository is being migrated (repeatedly in CI, until I'm satisfied with the result) with KDE's svn2git, but its rules do not seem cover this use case either.

I'm not sure if/how git-filter-repo could do this.

EDIT:

For clarification: My ultimate goal is to reduce the repository size while still keeping a maximum amount of history. Over the years, many files (binary files too - small and large ones) were committed and deleted later on. A large repository leads to slow clones, a slow Git CLI, and various issues in other tools (e.g. CI).


Solution

  • Personally, I would use the BFG for this sort of task (I am the author of the BFG, so maybe that's not surprising!), tho' it behaves in a way that is perhaps not ideal given what you're asking for - you're saying "delete all objects that are no longer in HEAD", and technically speaking, this command will do that:

    bfg --delete-files "*"
    

    ...it's saying "delete any file", but because the BFG protects your HEAD commit, your head commit will stay unchanged, with all it's files intact.

    However, perhaps undesirably, the implementation of this particular --delete-files parameter will remove those files from earlier commits, so the result is that pretty much all prior commits will have all files wiped - you can see that looking at this HEAD commit here - all files are preserved from the original commit, but they appear to have been all suddenly introduced just with the latest commit:

    https://github.com/bfg-repo-cleaner-demos/rails-with-all-non-head-files-deleted/commit/b689725edf03b86c31dc3e8d589fd01c0435ec8c

    Another approach with the BFG is to use the --strip-blobs-with-ids parameter, which is much more specific, but as you've noted, you would probably need to write a script to find out the blob ids for every single file that's not in the HEAD commit.

    I would encourage you to ask yourself what you want the payoff of this cleaning operation to be - that is to say, I know you've said you want to unconditionally delete all objects that are no longer in HEAD - but what is the reward you're want to reap from that? Here are a couple of possible rewards:

    • Reduced overall data size of the repo, making it faster to clone, and taking up less storage space in hosting and on developers laptops
    • Removal of any possible sensitive data that may have been committed to the repo in the past - if all you've got are the very latest files from HEAD anywhere in history, there's no potential for unwanted credentials or personal data to be hiding away in there.

    If the first of those - reduction of data size - is your primary concern, then it might be worth checking how much additional benefit you get from deleting all files from history, compared to just deleting the big ones. Due to the data structures used by git, it's very good at handling small files living over many many thousands of commits - they may well not end up taking substantial storage size - and it's only the big files that will cause serious bloat.

    So, I would suggest you run the bfg --delete-files "*" command above on a test copy of your repo, and check what the resulting reduction in size is - for example, maybe it's 90%. That would represent the maximum possible size savings. You could then perform a different run on a fresh copy of the test repo, where you try this command:

    bfg --strip-blobs-bigger-than 1M
    

    This is only deleting the 'big' files - ones over 1MB in size. What's the resulting repo size? Maybe the size reduction is only 85% in this case - however you have most of your history intact, which could conceivably be useful, and that extra 5% space saving might not be worth the effort of pursuing further.