git github version-control pull tree-conflict

How does git pull manage commit history?

Let's say I clone a remote repository, and so far it has 1 commit => A. Then, I make two commits to my local branch, and so it becomes => A - B - C. However, my coworker meantime made other two commits to their local branch, so their commit history becomes => A - D - E. And then they pushes it to remote repository.

Then I realize I want to push my changes, but git push tells me that remote repository is ahead of me. So, I do git pull.

My question is, what does now my local branch which tracks the remote-tracking branch looks like? I understand that there will be merge conflicts, but my actual question is: how does the commit history will look like?

To be more specific, say I fixed the conflicts, and committed them now, would my commit history look like this A - D - E - F or A - B - C - D - E - F? Does commit history in git is non-linear?

Solution

The shortest answer (not quite 100% accurate, but very close) is that git pull doesn't manage the history at all. What git pull does for you is run two Git commands that, as a beginner, I recommend you run yourself, separately:

First, git pull performs a git fetch. This command is pretty simple and straightforward (but with some twists). It obtains new commits from some other repository: your Git calls up some other Git, your Git and their Git exchange commit hash IDs, and from this, your Git discovers what commits (and associated files) you need to get from them, so that you'll have all of their commits with a reasonably-minimal amount of data brought over the Internet.
Once that's complete, git pull runs a second Git command. This is where most of the complexity lies. (These second commands tend to have a lot of options, modes, and features, so it's almost as if it runs one of a dozen commands.)

The choice of second Git command is yours, but when you use git pull, you're forced to make that choice before you have a chance to see what git fetch will do. I think this is Bad (capital B bad, but not bold or italic bad, so only moderately bad 😀). Once you've used Git a lot, and know how fetch works, and perhaps more important, have discovered how certain colleagues or co-workers or friends use Git—these all affect what git fetch will do—it can be safe to decide how to integrate fetched commits before fetching them. But early on, I think it's a bit too much to ask.¹

¹It's always possible to undo the things that the second command does, but you need to know all about that second command. As a beginner, you might not even realize that there are two different commands here. You certainly won't know enough to be able to undo each effect of each mode of each command.

You have the right setup after `git fetch`

Let's say I clone a remote repository, and so far it has 1 commit => A. Then, I make two commits to my local branch, and so it becomes => A - B - C. However, my coworker meantime made other two commits to their local branch, so their commit history becomes => A - D - E. And then they push [this] to [a shared remote] repository.

When they beat you to the punch and their git push to the shared (third) repository "wins", the commits in that shared third repository now have the A-D-E form:

A--D--E   <-- main

(The branch name here isn't all that important, but I'm using main since GitHub now use that as their default, and you mentioned github in tags.)

What the git fetch step gets you is commits D and E. You already have commit A, and no commit can ever be changed after it's made.² So you just need D-E, which wind up in your repository like this:

  B--C   <-- main
 /
A
 \
  D--E   <-- origin/main

The name origin/main is your Git's remote-tracking name, which your Git creates from their Git's branch name main. Your Git takes each of their Git's branch names and changes them, to make these remote-tracking names. Since the remote-tracking names aren't branch names, any changes that git fetch makes to them—to handle whatever happened in the other Git repository—won't affect any of your branches. Hence it's always safe to run git fetch.³

I drew commit A on its own line to emphasize how it's just the one commit, shared by both lines-of-development. And—something to think about—if a branch is a line of development, then isn't origin/main a branch, sort of? That's a fuzzy definition of "branch",⁴ but it turns out to be useful in a moment.

²Note that git commit --amend, for instance, does not actually change a commit. Instead, it makes a new commit, and has you use that instead of the other commit that you were using. You now have two almost-identical commits, with one just sort of shoved aside and ignored.

³You can set up git fetch, or give it arguments, that make it do "unsafe" things, but it's pretty hard. The usual easy way is to make a mirror clone, but a mirror clone is automatically --bare too, and a bare clone won't let you do any work in it. (Mirror clones are just for special situations, not for ordinary everyday work.)

⁴Git's definition of branch is deliberately weak and fuzzy, and it can be helpful to be careful to say branch name instead. Branch names are well-defined and don't suffer from this sort of philosophical ambiguity. A remote-tracking name is clearly different from a branch name, although both kinds of names let Git find commits, and the commits themselves form what we (humans) like to think of as "branches". So in that sense, origin/main is a name that finds a branch. It's just not a branch name: internally, it's spelled refs/remotes/origin/main, where a branch name has to start with refs/heads/. The branch name main is spelled refs/heads/main internally. See also What exactly do we mean by "branch"?

The second command: your choice of `git merge` or `git rebase`

The second command that git pull runs is where most of the real action happens. This is either git merge, or git rebase.⁵ These deal with the divergence you set up with your git fetch. Each one uses a different method.

Merging is fundamentally simpler than rebasing. This is because rebase works by copying commits, as if by running git cherry-pick—some forms of git rebase literally use git cherry-pick and others use an approximation—and each cherry-pick is itself a kind of merge. This means that when you rebase three commits, you're getting three merges performed, for instance. The copying that rebase performs is followed by one more internal Git operation, while many forms of git merge are one-step-and-done.

⁵Technically, git pull can run git checkout in one special case, but that case does not apply here.

Merging

Merging is, fundamentally, about combining work.

Note that we have to combine work when we have a situation like the one we drew above, where some common starting point (commit A) is followed by diverging work. There are, however, cases where "combining work" is trivial:

A   <-- theirs
 \
  B--C   <-- ours

Here, "they"—whoever they are—didn't actually do any work, so to "combine" your work with theirs, you can just have Git switch to your latest commit:

A--B--C   <-- (combined successfully)

Git calls this kind of "combining" a fast-forward operation, and when git merge does it, Git calls this a fast-forward merge. In general, if git merge can do a fast-forward merge, it will do one. If not, it will do a full-blown merge.

A full merge finds a merge base—a shared commit that's on both branches, using the deliberately-loose definition of branch I mentioned earlier, and compares the snapshot in that particular commit to the snapshot in both branch-tip commits. This allows Git to figure out "what we changed" and also "what they changed":

  B--C   <-- main
 /
A
 \
  D--E   <-- origin/main

The diff from A to C shows what we changed in our two commits. The diff from A to E shows what they changed in their two commits.

Git then attempts to combine and apply both sets of changes to the snapshot in commit A. If Git thinks that this went well, Git will go ahead and make a new snapshot—a new commit—from the result. By taking our changes and adding theirs (or, equivalently, taking their changes and adding ours), Git's merge commit will have, as its snapshot, the ?correct? combination. The question marks here are because Git is just using simple line-by-line rules. The result might not be correct in some other sense: it's just correct-by-Git's-rules.

In any case, the new merge commit that Git will make now links back to both our current commit C and their commit E:

  B--C
 /    \
A      F   <-- main
 \    /
  D--E   <-- origin/main

Our branch name, main, now selects the new merge commit F. Note that F has a snapshot, like any ordinary commit, and a log message and author and so on, like any ordinary commit. The only thing special about F is that instead of pointing back to one previous commit, it points back to two.

This has huge consequences, though, because the way Git finds commits is to start from some name—often a branch name, though any kind of name will do—and use that to locate the last commit, but then follow all the backwards links to all the previous commits. So from F, Git goes backwards to both C and E "at the same time".⁶

⁶Since this isn't quite possible, Git has to use some sort of approximation. Some parts of Git use breadth-first search algorithms, and others use various tricks.

Rebasing

Rebasing is, fundamentally, about taking some commits that are "okay-ish, but not good enough" and copying them to new-and-improved commits that are (supposedly) better, then abandoning the originals in favor of the new-and-improved copies.

There are a couple of problems with doing this:

Git "likes" adding new commits. It "doesn't like" tossing out old commits. Rebase forces Git to toss the old ones in favor of the new-and-improved ones, which is fine as far as it goes, but ...
We send commits from one Git repository to another. Once they've been copied—once the horses are out of the barn and cloned—it does no good to destroy some of them. If we have new-and-improved replacements, we have to have every Git that has copies of the originals pick up and switch to the new-and-improved replacements. This means we need to force other Gits to give up some existing commit(s).

A simple rule that always works is: Only replace commits you never gave out. This works because if you have the only copy, your new-and-improved replacements don't require getting any other Git to throw out the old ones. There is no other Git repository involved! But it's too simple, at least with many GitHub work-flows.

A more complicated way to deal with this is: Only replace commits that you and all other users of these repositories have agreed, in advance, can be replaced. The other users will—if they're paying attention, at least—notice the replacements and pick them up.

Without getting into all the details, what git rebase does is:

list out the commits to copy (the hash IDs);
use Git's detached HEAD mode to avoid the need for a temporary branch;
check out the target commit where the copies are to go-after;
copy the to-be-copied commits, one by one, using git cherry-pick or some equivalent; and last
move the branch name to point to the last copied commit.

In this case, you could rebase (copy) your two existing commits to two new-and-improved commits:

  B--C   <-- main
 /
A      B'-C'  <-- HEAD
 \    /
  D--E   <-- origin/main

where B' and C' are the copies of B and C. The snapshot in B' is built by making changes to the snapshot in E; the changes to be made are those seen by comparing A and B. The snapshot in C' is similar, but is made by taking the changes from B to C.

Once the copies are all done, Git peels the old main label off the old C commit and pastes it onto the new C' commit:

  B--C   [abandoned]
 /
A      B'-C'  <-- main (HEAD)
 \    /
  D--E   <-- origin/main

The original B and C commit still exist for some time, but without an easy way to find them, you just don't see them any more. If you did not carefully note down the real hash IDs of the original B and C, you would think that their new-and-improved replacements somehow magically changed B and C in place. But they didn't: they're entirely new, and the old commits still exist. The old commits are simply not used. After some time—at least 30 days by default—Git will consider them trash and, eventually, "garbage collect" them with git gc (which Git runs automatically for you, via git gc --auto spun off from various Git commands without you having to do anything).

If all goes well, the rebased commits "preserve the essence" of your work, making it look as though you started working after you saw what your colleague was going to do. The date-and-time stamps inside the copied commits are more complicated though:

The author date is when you originally wrote the commits, preserved.
The committer date is when you last used rebase to copy them.

You can repeatedly rebase commits, and the author timestamp persists in each copy. To see both timestamps, use git log --pretty=fuller, for instance.

How does git pull manage commit history?

You have the right setup after git fetch

The second command: your choice of git merge or git rebase

Merging

Rebasing

You have the right setup after `git fetch`

The second command: your choice of `git merge` or `git rebase`