Search code examples
gittfsgit-clonegithub-enterprisegit-tfs

Removing large files from TFS cloned git repository


I have been tasked with migrating our Team Foundation Server (TFS) repositories into the agency GitHub Enterprise (GHE) and keeping the entire changelog intact. I am using the git-tfs tool with the following syntax to create a local copy of the primary source branch:

git tfs clone --all --with-labels <server>:8080/tfs/ $/<branch>

The process takes about 30 hours and when that completes I have a directory structure of ~45 GB that contains a ~6 GB .git repository sub-structure. When I attempt to push this to our agency GHE I get errors regarding large files, because the agency doesn't have Large File Storage enabled and has no plans to enable it.

I have brought this to the attention of my superiors and been instructed to "remove the large files and make the upload." I ran an audit of all files >20 MB as instructed and have a spreadsheet I can copy/paste into Notepad++ for scripting the removal process.

I have attempted a git rm and then a git commit -m on the larger files, but am learning that this doesn't work as the changelog still tracks the large files. The git push to GHE command simply threw back the same errors I was seeing before.

My research has led me to several solutions, such as BFG Repo-cleaner and git filter-repo. Both tools require a --mirror copy of the repository, which git-tfs doesn't support. Git-tfs only supports a --bare option and the documentation for git clone doesn't help me understand the difference. I understand that both are just the repository directory and not the raw file structure, but not much more. I also do not understand how to push a mirrored local copy that doesn't have a file structure into GHE.

I've raised these issues to my leadership and been instructed to:

  1. git-tfs clone TFS to local
  2. git clone --mirror the local copy to a secondary local copy
  3. Attempt to run BFG or git-filter-repo against the secondary copy
  4. ???? [I don't know what comes after this]

I'm unclear on several things.

  1. Doesn't the mirrored secondary still point to the TFS as origin?
  2. Do I have to push the secondary local to the primary local and then push the primary local to GHE, as the secondary has no file structure?
  3. How do I perform an audit of the changelog to see what was modified to ensure that history is preserved? I don't want to be punished 6 months or a year from now because the developers are looking for a specific change and can't find it.

Solution

  • You shouldn't need a mirror to use git-filter-repo as it can work on an existing repo, and git-tfs should have left you with a working Git repo. If you can, I would just back up the entire 45(6) GB repo and then you can wipe your hands of the TFS portion. You now have a Git repo that you can play around with and if things go badly you can simply delete it and restore it from the backup.

    Once it's backed up, I would try using git-filter-repo to remove the large files. Even if you don't have a fresh clone you can use the --force option. There is also an option for removing files larger than a certain size, and in your case you might use: --strip-blobs-bigger-than 20M. Note that git-filter-repo is much faster than other options (and also git-tfs), so it's pretty common to do multiple passes. For example, you could first strip out all the large files, and then you might do another pass to remove some passwords or other undesirable changes (or entire commits).

    For your specific questions, the fact that you don't actually need a mirror makes your first 2 questions irrelevant. Once you have your repo the way you want it, then you just push it out into whatever Git host you'd like, such as GitHub Enterprise. For your third question:

    How do I perform an audit of the changelog to see what was modified to ensure that history is preserved?

    The way I've done this in the past is with the following checks:

    1. First compare the final state of the TFVC repo and the new Git repo. The only differences should be large files are missing in the new Git repo, and any other tweaks you may have made.
    2. Spot check some random TFVC changesets with their corresponding Git commits. Last time I did this I picked about 20 changesets at random to make sure they all matched up, and then I asked some developers to do the same which helped them gain confidence in the conversion. You could also pick some Git commits at random and go find their corresponding TFVC changesets too.
    3. Spot check TFVC changesets that you know were modified in Git (e.g. large files were deleted). The only difference between the two should be the large files should be gone. In the case where a TFVC changeset only contained a large file, there should be no corresponding Git commit at all.