Search code examples
git-rewrite-historybfg-repo-cleaner

BFG Repo Cleaner – Alternative to Fresh Clone


I was going to ask this on the repository but SO seemed like a more fitting place to ask this.

I was able to use BFG Repo Cleaner (great tool, thank you!) to reduce our .git folder size by over 1GB, which is a smashing success as far as our repository is concerned. I have not pushed my bare clone to remote yet, as I am concerned with putting forward these changes before understanding the consequences of pushing and then not re-cloning.

I understand that best practice dictates that when history has changed in this way, the best solution is to perform a fresh clone. However, I work with a team of over 50 people in a repository of over 2GB and 23k commits, and cross-team coordination can be incredibly difficult under our structure. As a result, I have some questions:

  1. What would the consequences be if I were to push these changed refs and people were to pull to their existing copy rather than create a fresh clone?
  2. Would they need to do anything else to mitigate these consequences as part of, or in addition to their pull, if this is feasible?
  3. Does this recommendation change at all if you consider that the blobs that were deleted are from history that is at least a year old and at most three years old?
  4. Finally, given that a new clone would not include any work not synced upstream, do you have a recommendation on the best way to carry over untracked branches from one clone to another? If a Git command already exists to do this, I would love to hear your insight.

Thanks again for creating such a handy tool, and hopefully I can finish making it useful for my team's project. I will continue to experiment on my fork in the meantime.


Solution

  • Preface

    Before we get into this, let me clarify the recommended process for cleaning Git history in the context of an active team of developers (no matter what technology used for the cleaning - whether BFG Repo-Cleaner or git filter-branch):

    1. Practice doing the clean a few times on a local disposable copy of your repository, so you're confident that you can do it and get the desired result, and you know how long it takes.
    2. COMMUNICATE WITH YOUR TEAM. This is essential, unavoidable (because Git is specifically built to complain and get in the way if history is rewritten) and just good practice for any team :-) You need to tell them:
      • Why the clean is happening (eg smaller repo!)
      • When the clean is planned - give them suitable advance warning.
      • To push all of their work up to the main repo before the clean commences - it doesn't need to be merged to the master branch, but all work needs to be on a pushed up on one branch or another.
      • Advise them they'll need to delete their old copies of the repo when the clean is done, and re-clone the newly cleaned repository
    3. When all work is pushed up to the main repo, do a mirror clone of the main repository. MAKE A BACKUP OF THIS CLONE, so that you can always go back if something goes wrong.
    4. Run the clean (with BFG Repo-Cleaner or a slower tool like git filter-branch), and use git gc to trim the dead objects.
    5. Once you're satisfied the clean has gone well, push the cleaned history back to the main repo (because it was a mirror clone, all the old branches/tags will be overwritten to the new cleaned history)
    6. Tell your team the time has come to delete their old copies of the repo, and re-clone the cleaned repository.

    So, to your questions:

    What if: a user with an old repo pulls from the cleaned repo?

    What would the consequences be if I were to push these changed refs and people were to pull to their existing copy rather than create a fresh clone?

    Bad. From experience I can say there will be a mess and people will get confused and upset.

    Specifically, what happens on that person's machine is that the git pull command will merge together the old dirty history and the new cleaned history, with two long divergent histories (diverging initially with the first 'dirty' commit in your history, which in your case was 3 years ago) being joined together with one brand new and very confusing merge commit. It's seldom clear to users that this has happened - most Git log visualisers will not render this in a way likely to make it apparent - if you're lucky a user might say something like "I've got two copies of every commit now, WTF?!" - but only if they're really observant.

    If that user later makes some new commits, and pushes back up to the main repository, they will have pushed the dirty history back up to the cleaned main repository, negating your work, making your history dirty again, and creating a very confusing Git history which all your other users will become exposed to next time they pull from the main Git repo.

    With planning, is there a way to let users keep their old repo but update it to have the cleaned history?

    Would they need to do anything else to mitigate these consequences as part of, or in addition to their pull, if this is feasible?

    Technically, yes. In practice, the procedure is complex, error-prone, and if just one user gets it wrong, you are screwed just like before.

    At this point, we have to work out why you're trying to dodge this procedure. Is it because:

    • You're trying to save users from having to know about & deal with the change Git history? It sounds like this might be your goal based on your saying "cross-team coordination can be incredibly difficult under our structure" - but unfortunately this is not an attainable goal, because Git will not let you change history without users noticing. Users will have to do something, and they will need to coordinate with you.
    • You want to reduce the download time of doing a fresh clone of your really massive repository, hoping that Git will only downloaded the changed blobs, and not all the stuff that didn't change? This is a slightly more reasonable goal for gigantic multi-gigabyte repos that take hours to download (tho' if you use the BFG to make the repo much smaller, there's less motivation)- unfortunately, due to details of the Git protocol you won't be able to realise those benefits. The Git protocol is designed to establish what commits are on the remote server that aren't in your local repo, and send a tailored packfile containing only what you need to bring your local repo up to date. This is great, but notice that the unit of comparison is commits. When you rewrite history, the file tree of the commits change hardly at all - but the commit ids all change, because the commit id is a hash of it's parental history, as well it's file tree content. The Git protocol is only comparing commit ids, and they are all different - so all the commits will get sent, along with their file-tree objects. The protocol doesn't dig deep enough to realise that it doesn't need to send most of those file-tree object - and so you don't get the benefit of already having copies of them in your local repo.

    Does it matter how long ago the bad stuff was in history?

    Does this recommendation change at all if you consider that the blobs that were deleted are from history that is at least a year old and at most three years old?

    If the bad stuff has been committed very recently, and no other users have pulled it yet (so, within the last few hours or minutes) you could possibly get away with quickly cleaning history on the main repo before anyone else pulls it. As soon as anyone else pulls dirty data, they need to be decontaminated, and the easiest way to do that is delete and re-clone.

    If the bad stuff was committed years ago, then everyone has it, and they all need to be decontaminated.

    What about stray commits/branches that weren't pushed up to the main repository when it was cleaned?

    Finally, given that a new clone would not include any work not synced upstream, do you have a recommendation on the best way to carry over untracked branches from one clone to another?

    The recommended way to deal with this problem is to make sure it does not happen. Communicate with your team, tell them that the repository cleaning is going to take place, and all they have to do to make it work is make sure they've pushed all their work up on any branch to the main repository before you start the cleaning.

    If someone doesn't do this, they can try rebasing the branches they care about onto the cleaned history. For each feature branch, something like:

    $ git rebase --onto clean-origin/feature unclean-origin/feature feature

    ...(which loosely translates to "take all the commits that are on my feature branch, that I didn't push to the main repo before it was cleaned, and replay them on top of the main repo's cleaned version of that branch).

    If the user gets this wrong, or forgets to do it for just one branch, you will be back to the bad mixed dirty/clean history scenario.

    Conclusion

    You know your team, are you sure they can all perform esoteric Git rebasing operations flawlessly? And what is the benefit if they do? After all is said and done, isn't it easier just to tell them to delete their old repo and re-clone?