Search code examples
gitbfg-repo-cleaner

ALL commits are duplicated after BFG (history editing)


I didn't know about the clean way to remove big/sensitive files with BFG, and missed a critical step: git clone --mirror git://example.com/some-big-repo.git

Which, when I tried to push to remote, lead to history conflicts, which I naively resolved with git pull origin master --allow-unrelated-histories, merged a few conflicts, then pushed.

this duplicated the commits, sometimes 5-10 times.

Since I'm alone working on this repo, I neither have an option to clean this to the ground and restart from a more reasonable copy, nor I have to worry that different commits are different, I'm sure they are identical.

Is there a brute-force command to erase all commits that are identical in all aspects but the hashes?


Solution

  • Is there a brute-force command to erase all commits that are identical in all aspects but the hashes?

    No. You can, however, toss out your merge commit, which is what ties the old history and new history together. That won't erase the old history, but you can just stop using it. Eventually, if your Git can't find it, it will fall away.

    What you will need to do is to run git reset --hard on your own repository (to discard the one merge commit), then use git push -f to send everything to origin and have them move their master.


    Two different commit hash IDs are two different commits, and it's impossible to change anything about any commit. That's why The BFG (and Git's own git filter-branch) copy all the commits: they literally can't change the old ones. That's how you got two copies of everything.

    First, you made new copies and tossed the old ones in favor of the new ones. That's what The BFG does. (That's not quite what git filter-branch does: it doesn't toss out the old ones, it just shoves them aside and then makes you toss them out.)

    So far so good. But then you ran git fetch to pick up all the old commits, followed git merge with the option: now smash together the old ones and the new ones, even though they have no relationship to each other.

    If your old and new commit histories were very simple we could draw them like this:

    A--B--...--H   <-- origin/master
    
    A'-B'-...--H'  <-- master
    

    (The uppercase letters stand in for commit hashes, and the prime marks, e.g., A' instead of A, indicate that these are copies-with-something-changed, which is why they have different hashes.) Presumably your histories—your commits—are more complex, but this representation is still sufficient: there's a single original end-point commit such as H, and a single new end-point commit H', involved.

    The merge you stuck at the end does this:

    A--B--...--H    <-- origin/master
                \
                 M   <-- master
                /
    A'-B'-...--H'
    

    (where the first parent of M is H' and the second parent of M is H). The name origin/master in your own Git is your own Git's memory of what origin's Git keeps saying, my master is <hash of H>: they are still remembering commit H as their master.

    If you remove commit M from the tip of your own branch master, you're left with this in your own repository:

    A--B--...--H    <-- origin/master
                \
                 M   [abandoned]
                /
    A'-B'-...--H'   <-- master
    

    Commit M still exists but you can't see it any more: there is no easy way to find it. The not-easy ways to find it will keep it around for at least another 30 days in case you decide you want it back, but eventually, they will let it fall away and be truly gone.

    Now at this point you can run:

    git push --force origin master
    

    to have your Git call up origin's Git, make sure that they have all the rewritten commits (A'-...-H'), and then send them a forceful command of the form: Yes, this loses you access to commit H, but set your master to point to commit H' instead. They'll normally obey this command—if they won't, you must find out why they won't (e.g., GitHub's "protected branch" feature) and fix that first—and then they will have:

    A--B--...--H   [abandoned]
    
    A'-B'-...--H'  <-- master
    

    (assuming you never sent them M—if you did, they'll have it too, but likewise abandoned). Your Git will see that they obeyed this command and will update your origin/master to reflect it:

    A--B--...--H   [abandoned]
                \
                 M   [abandoned]
                /
    A'-B'-...--H'   <-- master, origin/master
    

    When enough time has expired—typically much shorter for bare server repositories (e.g., those on GitHub), but 30+ days in your own repository—the abandoned commits will be swept away with the garbage when Git's garbage-collector runs and cleans up. At that point, no one will remember the original hash IDs and the original commands will be nowhere to be found.

    Well, nowhere, except any other clones that anyone ever made of them. If there are such clones, you may need to root them out and destroy them, or at least, make sure you never fetch-and-merge from them again, or you will get all the old commits back again.