cherry-picking commit - is commit a snapshot or patch?

I have a question related to cherry-picking commits and conflicts.

The 'Pro Git' book explains that commits are kind of snapshots and not patches/diffs.

But cherry-picking commit may behave as it was a patch.

Example below, in short:

create 3 commits, each time edit first (and single) line of the file
reset the branch to first commit
test1 : try to cherry-pick third commit (conflict)
test 2: try to cherry-pick second commit (OK)

mkdir gitlearn
cd gitlearn

touch file
git init
Initialized empty Git repository in /root/gitlearn/.git/

git add file

#fill file by single 'A'
echo A > file && cat file
A

git commit file -m A
[master (root-commit) 9d5dd4d] A
 1 file changed, 1 insertion(+)
 create mode 100644 file

#fill file by single 'B'
echo B > file && cat file
B

git commit file -m B
[master 28ad28f] B
 1 file changed, 1 insertion(+), 1 deletion(-)

#fill file by single 'C'
echo C > file && cat file
C

git commit file -m C
[master c90c5c8] C
 1 file changed, 1 insertion(+), 1 deletion(-)

git log --oneline
c90c5c8 C
28ad28f B
9d5dd4d A

test 1

#reset the branch to 9d5dd4d ('A' version)
git reset --hard HEAD~2
HEAD is now at 9d5dd4d A

git log --oneline
9d5dd4d A

#cherry-pick 'C' version over 'A'
git cherry-pick c90c5c8
error: could not apply c90c5c8... C
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add <paths>' or 'git rm <paths>'
hint: and commit the result with 'git commit'

#the conflict:
cat file
<<<<<<< HEAD
A
=======
C
>>>>>>> c90c5c8... C

test 2

#same for 'B' - succeeds
git reset --hard HEAD
HEAD is now at 9d5dd4d A

git cherry-pick 28ad28f
[master eb27a49] B
 1 file changed, 1 insertion(+), 1 deletion(-)

Please explain why test 1 failed (I could imagine the answer if commits were patches, but snapshots?)

Solution

The Pro Git book is correct: a commit is a snapshot.

You are also correct, though: git cherry-pick applies a patch. (Well, sort of: see further details below.)

How can this be? The answer is that when you cherry-pick a commit, you also specify which parent commit to consider, with the -m parent-number argument. The cherry-pick command then generates a diff against that parent, so that the resulting diff can be applied now.

Should you choose to cherry-pick a non-merge commit, there is only one parent, so you don't actually pass -m and the command uses the (single) parent to generate the diff. But the commit itself is still a snapshot, and it's the cherry-pick command that finds the diff of commit^1 (the first and only parent) vs commit and applies that.

Optional reading: It's not just a patch

Technically, git cherry-pick does a full-blown three-way merge, using Git's merge machinery. To understand why there's a distinction here, and what it is, we have to get a bit into the weeds of diffs, patches, and merges.

A diff between two files—or two snapshots of many files—produces a sort of recipe. Following the instructions won't bake you a cake (there is no flour, eggs, butter, and so on). Instead, it will take the "before" or "left hand side" file, or set of files, and produce as its result the "after" or "right hand side" file, or set of files. The instructions, then, include steps like "add a line after line 30" or "remove three lines at line 45".

The precise set of instructions generated by some diff algorithm depends on that algorithm. Git's simplest diffs use only two: delete some existing line(s) and add some new line(s) after some given starting point. That's not quite sufficient for new files and deleted files, so we can add delete file F1 and create all-new-file F2. Or, in some cases, we might replace a delete-file-F1-create-F2-instead with rename F1 to F2, optionally with additional changes. Git's most complicated diffs use all of these.¹

This gives us a simple set of definitions that applies not only to Git, but also to many other systems. In fact, before Git there were diff and patch. See also the wikipedia article on patch. A very brief summary definition of the two goes like this, though:

diff: a comparison of two or more files.
patch: a diff that is machine-readable and suitable for machine-applying.

These are useful outside version control systems, and is why they predated Git (though not, technically, version control, which dates back to the 1950s for computing, and probably thousands of years when generalized: I'll bet there were multiple different sketches for, say, the Lighthouse at Alexandria, or the Pyramid of Djoser). But we can have issues with a patch. Suppose someone has Version 1 of some program, and makes a patch for a problem with it. Later, we discover the same problem in Version 5. The patch may well not apply at this point, because the code has moved around—possibly even to different files, but certainly within the file. The context may have changed as well.

Larry Wall's patch program handled this using what it called offsetting and fuzz. See Why does this patch applied with a fuzz of 1, and fail with fuzz of 0? (This is very different from "fuzzing" in modern software testing.) But in a true version control system, we can do better—sometimes a great deal better. This is where the three way merge comes in.

Suppose we have some software, with multiple versions in the repository R. Each version V_i consists of some set of files. Doing a diff from V_i to V_j produces a (machine-readable, i.e., patch) recipe for turning version i into version j. This works regardless of the relative directions of i and j, i.e., we can go "back in time" to an older version when j ≺ i (the funky curly less-than is a precedes symbol, which allows for Git-style hash IDs as well as simple numeric versions like SVN's).

Now suppose that we have our patch p made by comparing V_i vs V_j. We'd like to apply patch p to some third version, V_k. What we need to know is this:

For each patch's change (and assuming that changes are "line oriented", as they are here):
- What file name in V_k corresponds to the file-pair in V_i vs V_j for this change? That is, perhaps we're fixing some function f(), but in versions i and j function f() is in file file1.ext and in version k it's in file file2.ext.
- What lines in V_k correspond to the changed lines? That is, even if f() didn't switch files, maybe it's been moved up or down a lot by a large deletion or insertion above f().

There are two ways to get this information. We can either compare V_i to V_k, or compare V_j to V_k. Both of these will get us the answers we need (although the precise details for using the answers will differ somewhat in some cases). If we choose—as Git does—to compare V_i to V_k, that gives us two diffs.

¹Git's diff also has a "find copies" option, but it's not used in merge and cherry-pick, and I've never found it useful myself. I think it's a bit deficient internally, i.e., this is an area that—at least someday—needs more work.

Regular merging

Now we make one more observation: In a normal true Git merge, we have a setup like this:

          I--J   <-- br1 (HEAD)
         /
...--G--H
         \
          K--L   <-- br2

where each uppercase letter represents a commit. Branch names br1 and br2 select commits J and L respectively, and the history working backwards from these two branch-tip commits comes together—joins up—at commit H, which is on both branches.

To perform git merge br2, Git finds all three of these commits. It then runs two git diffs: one compares H vs J, to see what we changed in branch br1, and the second compares H vs L, to see what they changed in branch br2. Git then combines the changes and, if this combining is successful, makes a new merge commit M, starting with the files in H, that:

preserves our changes, but also
adds their changes

and is therefore the correct merge result. Commit M looks like this in the graph:

          I--J
         /    \
...--G--H      M   <-- br1 (HEAD)
         \    /
          K--L   <-- br2

but it's the snapshot in M that matters more to us at the moment: the snapshot in M keeps our changes, i.e., has everything we did in br1, and adds their changes, i.e., acquires whatever feature or bug-fixes occurred in commits K and L.

Cherry-picking

Our situation is a bit different. We have:

...--P--C--...   <-- somebranch

We also have:

...--K--L   <-- ourbranch (HEAD)

where the ... part might join up with somebranch before the P-C parent/child commit pair, or might join up after the P-C commit pair, or whatever. That is, both of these are valid, though the former tends to be more common:

...--P--C--...   <-- somebranch
   \
    ...--K--L   <-- ourbranch (HEAD)

and:

...--P--C--...   <-- somebranch
             \
              ...--K--L   <-- ourbranch (HEAD)

(In the second example, any changes made in P-vs-C are normally already in both K and L, which is why it's less common. However, it's possible that someone reverted commit C in one of the ... sections, on purpose or even by mistake. For whatever reason, we now want those changes again.)

Running git cherry-pick doesn't just compare P-vs-C. It does indeed do that—this produces the diff / patch we want—but it then goes on to compare P vs L. Commit P is thus the merge base in a git merge style comparison.

The diff from P to L means, in effect, keep all our differences. As with the H-vs-K example in a true merge, we'll keep all our changes in the final commit. So a new "merge" commit M will have our changes. But Git will add to this the changes in P-vs-C, so we'll pick up the patch changes as well.

The diff from P to L provides the necessary information about which file function f() has moved to, if it has moved. The diff from P to L provides the necessary information about any offset needed for patching function f() as well. So by using the merge machinery, Git gains the ability to apply the patch to the correct line(s) of the correct file(s).

When Git makes the final "merge" commit M, though, instead of linking it to both input children, Git has it link back only to commit L:

...--P--C--...   <-- somebranch
   \
    ...--K--L--M   <-- ourbranch (HEAD)

That is, commit M is an ordinary single-parent (non-merge) commit this time. The changes in L-vs-M are the same as the changes in P-vs-C, except for any change in line offsets and file names that might be required.

Now, there are some caveats here. In particular, git diff doesn't identify multiple derived files from some merge base. If there are changes in P-vs-C that apply to file1.ext, but these changes need to be split into two files file2.ext and file3.ext when patching commit L, Git won't notice this. It's just a little too dumb. Also, git diff finds matching lines: it does not understand programming, and if there are spurious matches, such as lots of close braces or parentheses or whatever, that can throw off Git's diff so that it finds the wrong matching lines.

Note that Git's storage system is just fine here. It's the diff that's not smart enough. Make git diff smarter, and these kinds of operations—merge and cherry-picks—become smarter too.² For now, though, the diff operations, and hence the merges and cherry-picks, are what they are: someone and/or something should always inspect the result, by running automated tests, or looking at the files, or anything else you can think of (or a combination of all of these).

²They will need to machine-read whatever more-complex instructions come out of the diff pass. Internally, in diff, this is all in one big C program, with the diff engine acting almost like a library, but the principle is the same either way. There's a hard problem here—adapting to new diff output—and whether the format of this new diff is textual, as in separate programs that produce the diff and then apply it, or binary, as in internal library-like functions that produce change records, all you're doing here is "moving the hard around", as a colleague used to say.