Search code examples
gitgithubbfg-repo-cleaner

Troubleshooting getting rid of large files in a git repository on GitHub


I have a project called geoplot that does geospatial plotting in Python. The code for it is distributed via git on GitHub. You can check it out here.

As a part of the development process for this package, I uploaded and stored in the geoplot repo a folder called data/ which contained a large number of data files in various formats. These data files were used to populate the examples in the complimentary example gallery.

However, these files bloat the overall repository size way up to ~150 MiB (issue). This is clearly way too much, and it's time for me to get rid of them.

The problem is that I need to not just remove these files from the current HEAD, I also scrub these files out of the entire git history. I tried a manual approach using git rebase that didn't work. Then I tried the BFG Repo-Cleaner tool, as recommended in the canonical SO question on the matter.

BFG rid me of the files alright—they no longer exist anywhere in the history. However, the size of the repo (as seen when running https://github.com/ResidentMario/geoplot.git) did not go down at all!

Here is what I tried (minus printouts):

java -jar ../bfg-1.12.15.jar --delete-folders "data" .
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push --set-upstream https://github.com/ResidentMario/geoplot.git master --force

The full printout is in an issue on GitHub.

What, if anything, did I do wrong? How do I diagnose the source of and expunge this wasted space?


Solution

  • I did mention reflog and gc back in 2010, but also removing old objects.
    (Note: gc should be followed by a repack)

    First, check if by cloning your repo again, you still have the same size.

    As the OP Aleksey Bilogur mentions in the comments:

    • you need make sure your tag are not referencing the old data, and then you need to force-push all the tags and branches as well (not just master)

      git push --tags origin --force
      
    • generated data must be removed from the repo history.