git github binaryfiles merge-conflict-resolution git-merge-conflict

Keep both binary files when there is a merge conflict in Git

Disclaimer

I realize I'm using git for something it's not really designed for. But I'm so close to getting it to do what I want. If you have a better idea let me know...

TL;DR

I want to keep both binary files when there is a merge conflict. I've seen an answer here but I don't think it addresses my specific problem, at least, I couldn't figure out how it's actually accomplished.

The problem

I have hundreds of small (~4kb) binary files on a single master branch - each one is a sheet music file. Each piece of music needs to go through various stages before it's complete: formatting, adding chords, fixing lyrics, revised by person #1, revised by person #2, etc. With some batch files I can simply and easily both write and parse commit messages to generate a sort of report. Git seems to be a great solution to programatically track the status of each song. It's also very important that I keep the entire history of changes for each song and be able to look through the history easily (tortoisegit enables me to do this - right click the file and choose "git show log").

Every time a file is modified, the change is committed (i.e. each commit signifies a single changed file). So let's say I've got two songs, A and B (There are actually more than 400). There are multiple commits signifying changes in song A, and multiple commits signifying changes in song B, and the changes are spread out over the entire master branch like this:

A₁ - A₂ - B₁ - A₃ - B₂

Now let's say a user makes changes to both songs A and B and pushes it up to the remote, but I am also working on songs A and B and try to pull in changes on top of mine, like this:

Remote:
A₁ - A₂ - B₁ - A₃ - B₂ - Their B₃ - Their A₄
                                        ||               ||
Local:                              V               V
A₁ - A₂ - B₁ - A₃ - B₂ - My B₃     - My A₄

Classic merge conflict scenario, right?

How can I end up with something like the following?

Remote:
A₁ - A₂ - B₁ - A₃ - B₂ - Their B₃ - My B₃ - Their A₄ - My A₄

Local:
A₁ - A₂ - B₁ - A₃ - B₂ - Their B₃ - My B₃ - Their A₄ - My A₄

I've tried all the combinations and possible solutions that I could find on the web (I've learned a lot about git in the process) but can't seem to crack this one. Any help from programming wizards like yourselves is appreciated. I hope the question is clear enough.

Solution

Let's cover some background information first.

Git commits store snapshots of all of your files.¹ That is, each commit has a full copy of the bytes that make up each file that is stored inside that commit. The files inside any one particular commit have names with embedded slashes, such as path/to/file.ext. The copies in each commit are stored in a special, read-only, Git-only format. The copies in the commits are de-duplicated (so that if you make a new commit that just re-uses some previous files, you dont actually end up with a new copy)—this is made possible by the fact that it is literally impossible to change the files once they're stored. But most programs on your computer can't use the internal Git-only files, so to use the files, you must extract them, and that's what git checkout or (since Git 2.23) git switch does for you.

Your computer will, in general, hold these files in directories (or folders, if you prefer that term) so that you end up with path containing to which contains file.ext. Some computers have some issues with file-name case, e.g., they cannot store both README.TXT and readme.txt. But even Linux systems, which don't normally have this issue, literally cannot store both your B and their B under one name B. Likewise, Git can't use one name to store two different files inside one commit: each file inside any given commit has to have its own unique name.

Initially, that's not a problem: you and your colleagues only have one A and one B and so on. If only one person changes it, you and they pick up the updated A or B, and the updated file goes into the newer commit. The work-tree copy, the one your computer can use, is just one copy; when you make a new commit, the new commit stores a new (or re-uses the old) frozen, Git-ified version, as appropriate. But when you both change one of these ... well, this brings us to the merge and the merge conflict.

¹That's the data part of a commit. Commits also store metadata, which is information about the commit itself, such as who made it, when, and why: your log message. In the metadata, each commit stores the hash ID of its parent commit, too. For merge commits, the commit stores the IDs (plural) of each parent. But here, we're mostly interested in just the data.

Merging finds differences

When you run:

git checkout somebranch       # or git switch somebranch
git merge otherbranch

Git finds the best common ancestor commit, then runs two git diffs to see what files you changed (and how), and what files they changed (and how). For text files, Git turns the "and how" part into a textual difference, then attempts to combine the two differences. Since your files are binary, Git can't do that, and probably does not even try. :-)

Your work-tree, alas, can only hold one copy of file B. But Git isn't really using your work-tree copy. That's just for you. Git is using committed copies, in the frozen format, internally. Git stores these in what Git calls its index, or staging area, or (rarely these days) cache. Normally, the index holds only one frozen-format copy of any one file. But during a merge conflict, Git expands the index.

At this point, Git has not one, not two, but three active copies of file B. The fact that there are these three different copies of B—the one that you both shared from the merge base, and then your version and their version from the two commits you're trying to merge—is why there are three copies, and the fact that there are three copies is how Git represents the merge conflict.

Your job, at this point, is to arrange for the index to hold only one copy of each file, ready to be committed.

You will have to rename at least one file

So you now have a merge conflict, with three copies of some binary file—I will keep calling it B—that were in the three commits, now copied into Git's index.² Git gives you a convenient way to extract two of those three copies:

git checkout --ours B

and:

git checkout --theirs B

These two commands copy the two non-merge-base copies of the file from the index, to your work-tree. So:

git checkout --ours B; mv B "My B"; git add "My B"
git checkout --theirs B; mv B "Their B"; git add "Their B"
git rm --cached B

will tell Git to extract your version of B first, which you then rename and add as My B. Then you have Git extract their version of B, rename it, and add it. Finally, you tell Git that the correct way to resolve the conflict between the three versions of file B is to remove file B.

The new commit you make from this will have, as part of its snapshot, the two renamed B copies, and no original B at all.

²Technically, what's in Git's index is not actually a copy. It's the hash ID of the de-duplicated frozen-format blob object that Git uses. But you can just think of it as a copy; it works like one.