git git-rebase git-filter-branch git-history

Impact of editing git history

We've got a pretty bad situation with the state of one of our repos. Someone carelessly committed 4 GB of binary files to the repo and pushed it to the remote master. Then, they said "Oops!" and reverted that commit.

Unfortunately, git only stores the diff, and because it can't really store the diff of binary files, it stores the entire file in the history. And because it was included twice in the history (once when it was added, once when it was removed) the repo is now 8 GB in size. This causes huge problems for us and makes our builds take about an hour longer than they need to.

I understand that I can use tools like rebase and filter-branch to either remove these commits or remove these files from the git history. However, every single post or documentation on these tools says "If the commits you want to edit have already been pushed to remote, then DON'T DO IT! Rewriting history is a BAD IDEA!!!"

But nowhere is it actually explained what is the impact of rewriting history. We really have no choice here - we've GOT to remove these files from the history. But, with all of the dire warnings about the dangers of rewriting git history, we are very scared to actually attempt to remove these files.

So, I am hoping that a helpful StackOverflow user could explain what could possibly be the impact of removing these huge files using filter-branch, or maybe if there is some better solution that we aren't aware of.

Solution

It's a common misconception that git stores diffs. It actually stores the full contents of every version*. In fact, the entire model of git is built around a guaranteed bit-perfect retrieval of source code, something that diff-based VCS's just can't pull off.

You've probably either got two commits with the binaries, or you're counting both the copy in the database and the one in the working directory.

To answer your core question though.

Git stores data as a collection of objects that reference each other. (See Merkle Trees) Because both the trees and history are built of objects that reference other objects, it's very very difficult to truly eliminate shared data from a git repo.

"Rewriting history" is even a bit of a misnomer, as git never re-writes history, it just goes back and creates a new history, then points to that new history instead. The old stuff can hang about for months before garbage collection. Once you start to share that, in git's logical model, your re-written history is just another branch on another instance of the repo.

Normally, branches move the codebase forward, and can be merged to bring that history together. If you have a feature branch called feature1 and merge it into your master branch, it isn't just the code that becomes part of master, all the commits on feature1 become part of master as well. When each branch is a discrete bit of code, this isn't an issue.

It does become an issue when you try to re-write history. Let's say you do what you're suggesting, and remove the code from the history using filter-branch (though a rebase would be easier and probably safer it it's fairly recent). Every member of your team deletes their local copy of that branch and checks out the new one. Everything is great, except you were working on featureX, and had already merged the master branch into it after the mistake happened, so the old master is part of your featureX branch. Doing a diff between featureX and master will show the same results as a diff between featureX and the old master, but all those commits are still part of featureX. In git's brain, featureX branched off at the point the large files were added, and when you merge it into master, featureX brings everything back in with it.

So that's the danger, if even one person, somewhere on any of their branches, still has a copy of the old commit in the history, you'll end up not only still having the files you're trying to get rid of, but a whole second version of the history to deal with as well.

If you must remove it, it can be done, but you'll have to very carefully coordinate the process to ensure that every instance of the repository has been cleaned. With a very small team, this isn't horrible, but the bigger and more distributed your team is, the harder it gets.

*It does do some clever delta-compression stuff when it packs objects for storage, but always in a way that guarantees a bit-perfect reconstruction. Git will detect even one bit out of place in the entire history as a broken repo.