Search code examples
git-rewrite-historybfg-repo-cleaner

Possible to undo permanent deletion of file?


A colleague of mine attempted to permanently remove a file (Diff.java) from the history of our GitHub repo.

He had good reasons for wanting to do this, however something seems to have gone wrong as we seem to have lost quite a few files which have been replaced by equivalent files with the suffix .REMOVED.git-id. For example ivy-2.2.0.jar -> ivy-2.2.0.jar.REMOVED.git-id.

I have managed to repair the main development branch as I happened to have a copy locally. However there are many historical branches for development lines and tags for releases that now seem to be broken in the way described above.

I understand that he ran a process similar to:

$ git clone --mirror git://example.com/some-big-repo.git
$ java -jar bfg-1.12.3.jar --strip-biggest-blobs 500 some-big-repo
$ cd some-big-repo
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive
$ git push

$ cd ..
$ java -jar bfg-1.12.3.jar --delete-files Diff.java some-big-repo
$ cd some-big-repo
$ git push

I am guessing that the process was destructive, and there is no way to recover unless we happen to have a clean mirror somewhere before this happened. Can anyone confirm or offer some advice?


Solution

  • This was the step that deleted all those old jars:

    $ java -jar bfg-1.12.3.jar --strip-biggest-blobs 500 some-big-repo

    ...as the author of the BFG, I'm distressed to realise --strip-biggest-blobs 500 wasn't as clear as I thought. The command removes the largest 500 files (ie big files, or binary-large-objects: 'blobs') from the repositories history. I would be very interested to know what the user thought that step would do!

    This is the command that correctly got rid of Diff.java:

    $ java -jar bfg-1.12.3.jar --delete-files Diff.java some-big-repo

    The instructions for the BFG say "you should make a backup" of your repository before running the BFG, but it sounds like that didn't happen here.

    You may still have a chance to recover your old branches and tags, given two things:

    1. Repositories where the raw object data is still available. That would be your local copy, and possibly also GitHub, as they don't run git gc on their repos immediately - the objects may well still be around, and may even be referenced by old pull requests, if you use them. I would take an immediate mirror clone of your GitHub repo.
    2. You also need the old 'ref' values (the original branch and tag commit ids). You may be able to find them in the reflog of your local copy, or in the logs of your CI server. The BFG prints out the old and new values of changed refs on the command line, but I guess you don't still have that output. The BFG does not currently save that output, but it does save a object-id-map.old-new.txt file under the some-big-repo.bfg-report directory every time it runs, containing the old ids, and the new ids, for every commit it altered. There will be more than one of these files, because the BFG was run more than once. Using these files, and examining your current refs, you should be able to back-track through the two BFG runs to find out what the original commit ids of your refs were.

    Your recovery process, given those things, is something like this:

    • Take a --mirror clone of your repository most likely to still contain your old objects.
    • Test to see if it really does have those objects. So, supposing you can establish that the old id for master was 686b0cd80ac328e060b80dda3c9dadb1e400134a, do git cat-file -p 686b0cd80ac328e060b80dda3c9dadb1e400134a. You will see a summary of the commit if the object is still around. if it's not, add remotes for your other candidate repos, and try pulling in the data from there
    • Set master branch to the value of the original commit with git update-ref: git update-ref refs/heads/master 686b0cd80ac328e060b80dda3c9dadb1e400134a

    Repeat for all the other branches and tags that you care about - hopefully you can script this, good luck!