Search code examples
gitgit-rewrite-historybfg-repo-cleaner

Git Repository Only Gets Bigger After Using BFG


We are currently in the process of migrating our SVN repo to GIT (hosted at bitbucket). I used subgit to import all our branches/history into a bare repo i have locally on my (Windows) PC.

The repo is quite big (7.42 GB after the import) this is because it also contains information about SVN like revision numbers to provide a way to have a two way sync between Git and SVN (I'm only interested in a one way SVN to GIT).

I create a local clone of the imported bare repo and push all the branches to bitbucket. After a couple of hours (!) the repo was fully uploaded. BitBucket now gave me warnings about the repo size. I checked the size and it was 1.1GB. Thats not as big as the imported bare but still to big to have a fast repository.

After playing around with BFG i managed to remove soms large DLL/SQL export files using these commands on the bare repo (I only use the clone for pushing without all the svn-related refs):

java -jar bfg.jar --delete-files '{''specialized 2015''','''specialized,''insert-pcreeks''}.sql' --no-blob-protection

java -jar bfg.jar --delete-files 'Incara.*.dll' --no-blob-protection Incara.git

git reflog expire --expire=now --all && git gc --prune=now --aggressive

This took a while and afterwards the git_find_big.sh script did not show these large sql files anymore. But after pushing things back to bitbucket (as a new repo, not as a force push) it only got bigger (1.8GB)

Can you provide a possible explanation for this behavior?

I don't know if it matters but we used a non standard branch/tag model in svn. This resulted in branches like: /refs/heads/archive/some/path/to/branch. These branches seemed to work just fine and removing them also did not affect the size.

Next to these problems i noticed i had some XML files showing up in the git_find_big.sh output:

size,pack,SHA,location 12180,1011,56731c772febd7db11de5a66674fe6a1a9ec00a7 repository/frontend.xml 12074,1002,0cefaee608c06621adfa4a9120ed7ef651076c33 repository/frontend.xml 12073,1002,a1c36cf49ec736a7fc069dcc834b784ada4b6a06 repository/frontend.xml 12073,1002,1ba5bd92817347739d3fba375fc42641016a5c1d repository/frontend.xml 12073,1002,e9182762bfc5849bc6645fdd6358265c3930779f repository/frontend.xml 12073,1002,dff5733d67cb0306534ac41a4c55b3bbaa436a2e repository/frontend.xml 12072,1002,8ee628f645ce53d970c3cf9fdae8d2697224e64c repository/frontend.xml 12072,1002,1266dee72b33f7a05ca67488c485ea8afc323615 repository/frontend.xml

These files contain the frontend logic of the web platform we are using and are indeed quite big. But they should be treated as text right? Therefore I don't get why they show up as separate objects in the above output. Am i right this should not be happening?

The SVN import also resulted in some empty commits (for example when SVN creates or moves a branch it needs a new commit). I guess these can only be removed using filter-branch?

Sorry, I have a lot of questions! Could someone help me with this?

Thanks,

Piet


Solution

  • I've asked for some more diagnostic information in comments on your question, which would be needed to give a reasonable answer to the main part, but as for your secondary questions (which Stackoverflow encourages you to ask separately, incidentally!), here are some pointers:

    Next to these problems i noticed i had some XML files showing up in the git_find_big.sh output: [snip]

    These files contain the frontend logic of the web platform we are using and are indeed quite big. But they should be treated as text right? Therefore I don't get why they show up as separate objects in the above output. Am i right this should not be happening?

    Git allocates ids based on the contents of files (a SHA hash), and as far as that goes, doesn't care whether your files are text or not - if the files are even slightly different, their ids are different, and will be stored separately (Git may do delta compression under-the-hood, but this doesn't stop the files being defined as logically separate). So it's not surprising that different versions of the same file show up more than once in the git_find_big.sh output.

    The SVN import also resulted in some empty commits (for example when SVN creates or moves a branch it needs a new commit). I guess these can only be removed using filter-branch?

    Yep, BFG doesn't do this out-of-the-box. However, it's one task that filter-branch does do reasonably quickly (even if it is fiddly to use).