I was going to ask this on the repository but SO seemed like a more fitting place to ask this.
I was able to use BFG Repo Cleaner (great tool, thank you!) to reduce our .git
folder size by over 1GB, which is a smashing success as far as our repository is concerned. I have not pushed my bare clone to remote yet, as I am concerned with putting forward these changes before understanding the consequences of pushing and then not re-cloning.
I understand that best practice dictates that when history has changed in this way, the best solution is to perform a fresh clone. However, I work with a team of over 50 people in a repository of over 2GB and 23k commits, and cross-team coordination can be incredibly difficult under our structure. As a result, I have some questions:
Thanks again for creating such a handy tool, and hopefully I can finish making it useful for my team's project. I will continue to experiment on my fork in the meantime.
Before we get into this, let me clarify the recommended process for cleaning Git history in the context of an active team of developers (no matter what technology used for the cleaning - whether BFG Repo-Cleaner or git filter-branch
):
git filter-branch
), and use git gc
to trim the dead objects.mirror
clone, all the old branches/tags will be overwritten to the new cleaned history)So, to your questions:
What would the consequences be if I were to push these changed refs and people were to pull to their existing copy rather than create a fresh clone?
Bad. From experience I can say there will be a mess and people will get confused and upset.
Specifically, what happens on that person's machine is that the git pull
command will merge together the old dirty history and the new cleaned history, with two long divergent histories (diverging initially with the first 'dirty' commit in your history, which in your case was 3 years ago) being joined together with one brand new and very confusing merge commit. It's seldom clear to users that this has happened - most Git log visualisers will not render this in a way likely to make it apparent - if you're lucky a user might say something like "I've got two copies of every commit now, WTF?!" - but only if they're really observant.
If that user later makes some new commits, and pushes back up to the main repository, they will have pushed the dirty history back up to the cleaned main repository, negating your work, making your history dirty again, and creating a very confusing Git history which all your other users will become exposed to next time they pull from the main Git repo.
Would they need to do anything else to mitigate these consequences as part of, or in addition to their pull, if this is feasible?
Technically, yes. In practice, the procedure is complex, error-prone, and if just one user gets it wrong, you are screwed just like before.
At this point, we have to work out why you're trying to dodge this procedure. Is it because:
Does this recommendation change at all if you consider that the blobs that were deleted are from history that is at least a year old and at most three years old?
If the bad stuff has been committed very recently, and no other users have pulled it yet (so, within the last few hours or minutes) you could possibly get away with quickly cleaning history on the main repo before anyone else pulls it. As soon as anyone else pulls dirty data, they need to be decontaminated, and the easiest way to do that is delete and re-clone.
If the bad stuff was committed years ago, then everyone has it, and they all need to be decontaminated.
Finally, given that a new clone would not include any work not synced upstream, do you have a recommendation on the best way to carry over untracked branches from one clone to another?
The recommended way to deal with this problem is to make sure it does not happen. Communicate with your team, tell them that the repository cleaning is going to take place, and all they have to do to make it work is make sure they've pushed all their work up on any branch to the main repository before you start the cleaning.
If someone doesn't do this, they can try rebasing the branches they care about onto the cleaned history. For each feature
branch, something like:
$ git rebase --onto clean-origin/feature unclean-origin/feature feature
...(which loosely translates to "take all the commits that are on my feature branch, that I didn't push to the main repo before it was cleaned, and replay them on top of the main repo's cleaned version of that branch).
If the user gets this wrong, or forgets to do it for just one branch, you will be back to the bad mixed dirty/clean history scenario.
You know your team, are you sure they can all perform esoteric Git rebasing operations flawlessly? And what is the benefit if they do? After all is said and done, isn't it easier just to tell them to delete their old repo and re-clone?