Search code examples
gitgithubbfg-repo-cleaner

BFG duplicated my previous commits along with "clean" commits


I was trying to remove some files/folders that were accidentally uploaded to my remote git repository via BFG, and after following a guide, I seem to have duplicated commits -- one set of branches that are purged of the data, and one set that still has the data. Here is the network graph demonstrating this: https://github.com/barricklab/pLannotate/network

I first:

git clone --mirror https://github.com/barricklab/pLannotate.git

I then used several commands similar to:

bfg --delete-files *.gbk

and eventually used:

git reflog expire --expire=now --all && git gc --prune=now --aggressive

git push

I realized a had a local commit that wasn't pushed before I cloned, so maybe this had something to do with it? Im not sure. At this point, Im terrified of doing more damage to the repository and Im not sure how to remove the alternate set of branches that still contain the files I was trying to remove.

The very first commits to repository after the initial branching highlight the "good"(files removed) and "bad"(files still present) branches:

https://github.com/barricklab/pLannotate/commit/e146338a62cda43f4d09df90ce90472807f0b60b https://github.com/barricklab/pLannotate/commit/01b5ee7bbb697d3aba30d4d2944ae716dfc53ab9

Can anyone help me get out of this pickle and remove this duplicate set of branches?


Solution

  • ... after following [the] guide, I seem to have duplicated commits

    This is how The BFG works.

    This is how anything that does this sort of job with Git works, because no commit can ever be changed. It is literally impossible to "fix" a bad commit. The only thing anyone or anything can do is make a new "duplicate" (but slightly different) commit, which gets a new and different hash ID.

    Because commits form chains, and Git works backwards from the last commit to the first, any change you want made requires updating every subsequent commit even if the file-snapshots of the subsequent commits are 100% identical to the originals:

    A  <-B  <-C  <-D  <-E  <-F  <-G  <-H   <--main
                        ^
                        |
       let's say this commit is bad: has a big file
    

    To "fix" this big-file problem, even though the big file is removed in commit F, we must copy commit E to a new-and-improved commit E':

    A--B--C--D--E--F--G--H
              \
               E'
    

    Once we've done that, we must now copy commit F to a new-and-improved F', with the one change being that F' points back to E', rather than to the original (bad) E:

    A--B--C--D--E--F--G--H
              \
               E'-F'
    

    Once we've done that, we're forced to copy G for the same reason, and again with H. The final result is:

               E--F--G--H   [abandoned]
              /
    A--B--C--D--E'-F'-G'-H'   <--main
    

    The BFG and other Git fixers will, if/when appropriate, discard the old commits entirely (Git likes to hang on to them as long as possible). But if you introduce this new repository to the old repository again, the old repository will say: Oh, I see you're missing these commits, E-F-G-H and give them right back to you and let you merge them:

               E---F--G---H
              /            \
    A--B--C--D--E'-F'-G'-H'-M   <--main
    

    and now you have the old commits and the new commits. The solution to this is to make sure you never touch the new repository, with the altered commits, to any of the old repositories, so that the Git using the old repository can't give you back the old commits you purged when you made the new ones.

    In other words, don't rejoin a filtered repository with its pre-filtered version or you'll bring back everything you just worked so hard to get rid of.

    Fixing the mess if you've rejoined the old commits

    To remove such a merge as M above, assuming you've just added it, you'd generally want to run git reset --hard HEAD^ or git reset --hard HEAD~. (Both of these do the same thing, although some command line interpreters make one or the other easier to type in: CMD.EXE in particular makes you type ^^ instead of ^ so ~ is easier. Note that you can, but don't have to, add 1 after ^ or ~ as well.)

    Depending on what you use to view commits, you may well still see both the old and new commits. What you should no longer see, after the reset, is the added merge commit: the old and new commits will be separate "strands".

    To update a GitHub or Bitbucket or other hosting site, you must force it to replace the old commits with the new-and-improved commits. There are two options here:

    • Remove or rename the old repository, so that it no longer exists on the hosting site, or exists under some different name. Create a new, empty repository on the hosting site, and use git push from the local repository. You may want to use git push --mirror, which automatically pushes all branches and tags, but note that this also pushes all remote-tracking names, which you might not want to do. You may instead want git push --all --tags.

    • Or, use git push --force, again perhaps with --mirror or --all --tags.

    Note that with git push --force, you're losing your backup on the hosting site, so be very sure that you have the right set of commits here. The BFG does an in-place rewrite; some other repository-adjusters, such as git filter-repo, require that you run on a freshly made clone so that you aren't damaging any "regular work" clone, so that you have a backup there.

    In all cases, consider making your own personal backups before doing anything. It's almost always easier to restore from a personal backup that you just made just now, than it is to restore from some standard backup that you hope was made last week but it turns out that the backup system died last year and no one got around to fixing it because everything has been just fine, why do you ask? 😱