I've read about the Git internals here and here and know what a commit is, as well as a tree and a blob.
I know that Git stores individual files instead of file differences (deltas), and that the later ones are calculated in real time as necessary. The documentation also speaks often about the "difference between two commits" (whether they are parent and child, ancestor/descendant or neither of them).
However, it's not clear to me how Git calculates those deltas in various situations (cherry-picking, merge, rebase). And which files (i.e. files from which commit) are considered in each case?
I've read that according to that structure a single commit can be considered a whole branch (i.e. the commit history leading up to that commit) in the sense that for a given file I can reach all of its versions by traversing the branch back (though not necessarily back to its root I suppose; just back to a immediately previous file version may be enough). If my assumption is wrong, please clarify.
The rules are simple enough conceptually but get complicated in practice.
A real git merge
uses the commit DAG to find the merge base(s). The merge base is defined as the Lowest Common Ancestor (generalized in the obvious way to arbitrary DAGs where there may be multiple LCAs, vs simple trees where there's always a unique LCA). The git merge-base
command will, given two commits, find a (default) or all (--all
) merge base commits from the DAG.
If there are multiple merge bases, the algorithm depends on the -s
(strategy) argument. The default recursive
strategy merges the merge-bases using recursion (what else? :-) ). This is currently done the slow-simple-stupid way: if there are 5 merge bases, Git merges two of them (finding the merge base of those two as needed) and makes a "virtual commit" from the result, merges that result with the next (3rd) in the list-of-5, merges that result with the 4th, and merges that with the 5th to get the final virtual merge base. (To make this all work correctly, I believe Git actually makes real commits. There's no reason not to: these unreferenced commits will be garbage-collected automatically later.)
The resolve
strategy simply picks one of the multiple merge bases and uses that as the base.
In any case, the two diffs that get combined, once we have a single merge base hash ID $base
and the two branch-tips, are the output from:
git diff $base $tip1
git diff $base $tip2
(more or less—there's some tweaking of the --rename-limit
value if needed, depending on extra merge command arguments, and all this assumes no special merge drivers; the actual merging happens file-by-file, but the merge base version for each file comes from $base
, with any rename detection happening first from the two commit-wide diffs).
The git cherry-pick
command diffs each commit against its parent, and then first tries to apply the resulting delta as a patch. If that fails it falls back on "three way merge", but the merge base is on a file-by-file basis rather than a commit-by-commit basis, because it uses the Index:
information in the formatted patch. There's one Index:
line per file-in-the-patch, giving the SHA-1 IDs of the two blobs in question.
Thus, the merge base is initially ignored entirely: the cherry-pick just uses the patch as a patch. Only if the patch does not apply (as in git apply
) does the cherry-pick fall back to a three-way merge (as in git apply -3
). The blob itself must also exist in your repository—for a cherry-pick, it always does; for a literal git apply
of an emailed patch, it may not.
At this point the two diffs to be combined are:
git diff $indexbase $file1
the diff in the patch # equivalent to git diff $indexbase $file2
where $indexbase
is the file extracted by the hash ID in the Index:
line and $file1
is the file in your work-tree. (This file matches the HEAD
commit unless you're using git cherry-pick -n
.) In an arbitrary (emailed) patch you don't necessarily have $file2
at all, just the diff; in a cherry-picked patch, $file2
is the version of the file in the commit being cherry-picked (but it's not needed since we already have the diff!).
If you cherry-pick a merge commit, you must tell Git which parent of that merge commit is to be used to produce a changeset-as-patch. This step is completely manual.
A rebase consists, functionally, of a series of cherry-pick operations. Merge commits are omitted from rebases. (Interactive rebase's --preserve-merges
operation makes new merges, completely ignoring the original merge.) An interactive rebase literally runs git cherry-pick
(one at a time for each commit to be copied), while a non-interactive rebase attempts to use git format-patch <args> | git am -3
if it can (format-patch elides "empty" commits so this is only possible without -k
).
The commits to be copied are chosen via an actual git rev-list --cherry-pick
on a symmetric difference in some cases, or, for algorithmic purposes, something equivalent.