Confused by git diff

I want to generate a patch file of the differences between my branch and master. But the branch is quite long-lived so I just did a merge from master to bring it up to date. I can see the differences fine if I start creating a pull request in Bitbucket. But when I do git diff master.. on my branch I see differences shown that aren't there. Are they resulting from the merge? How can I get a list of differences the same as Bitbucket - just the differences between my branch and master right now?

Solution

TL;DR

It's not clear to me quite where your confusion starts, but it's worth noting that using git diff is quite different from generating a pull request. Eventually, it will boil down to running git diff on the correct specific commits. The trick lies in finding the right commits.

Long

What's in a repository

First, remember what it is that Git keeps. At a sort of fundamental level, what Git cares about are source snapshots, saved in the form of commits. A commit contains a complete snapshot of some source tree. A commit also contains some metadata: name and email addresses of the person, or sometimes two people, who made the commit (author and committer: they may be the same, or separate) and time-stamps for when they made it; a parent commit ID, so that Git can present the series of commits as a history of who (author) did what (see below), and when (timestamp); and a log message, to provide the author's description of why they did what they did.

Since each commit is a full snapshot, in order to see who actually did what, we must use a command like git diff. Suppose we have two commits done in succession, on branch master, like this:

(parent)   (child)
df731ac <- 049a12b   <-- master

A branch name like master lets us find the most recent commit 049a12b. We use the child's stored parent ID df731ac to find the parent, and then we can run git diff df731ac 049a12b—or much more simply, git show master—to compare df731ac to 049a12b.

Whatever comes up as different here, the author of 049a12b must have changed it. But df731ac (the predecessor or parent commit) is a complete snapshot, and 049a12b (the successor or child commit that is the tip of branch master) is also a complete snapshot. Knowing this is helpful for understanding the next part.

Note that, as in the drawing above, a branch name like master or develop or feature/tall simply contains the ID of one specific commit. We call this commit the tip commit of the branch. When you add new commits to a branch, what Git does is create the new commit, which gives it an ID, and then write the new tip commit ID into the branch name. The branch names therefore "move" over time: they always point to the latest (child-most) commit. Each new commit has, as its parent, the ID that was the tip of the branch before, which lets Git follow these backwards pointers through the repository.

If Git commit hash IDs were just one letter, we could draw a simple three-commit repository as:

A <-B <-C   <-- master

and adding a new commit would simply consist of writing commit D with C as its parent, and making master point to D:

A--B--C--D   <-- master

The special name HEAD normally contains the name of a branch. So if HEAD contains master, Git can use HEAD to select branch master, and master to find D. In other words, Git typically starts by using a branch name to get a tip commit ID. Then it looks at that commit to get its parent ID, then looks at the parent commit for another parent, and so forth. This is what branch names are for, and what they do: they find tip commits.

Using `git diff`

All git diff does (most of the time anyway¹) is to take any two individual commits like this and compare them. To do this it needs to resolve its two inputs to hash IDs. Those hash IDs are the two commits; it then compares the two snapshots.

When you run git diff master.., Git's diff translates master.. into master and HEAD (the default to fill in an for empty position around .. is HEAD), and then translates master into a branch tip ID. If the tip commit of branch master is 049a12b as in the drawing above, the hash ID for the left half of the comparison will be 049a12b. For the right half, git diff must read HEAD to get its branch name, such as develop or feature/tall or whatever. That branch name then maps to its own tip commit. Let's say it's abbreviated ID is 6bc9702. Then this git diff command ultimately tells Git to extract the source snapshot for 049a12b, the one for 6bc9702, and compare those two.

You can, however, supply any two hashes for any two commits that you have:

git diff 0123456 fedcba9

for instance. But you have to find those commits, or some name that Git will turn into those commits.

(It doesn't matter if you say git diff A B or git diff A..B; these mean exactly the same thing. This is different from git log and most other Git commands: only git diff has this special handling for the two-dot .. syntax. However, the rule that fills in HEAD if one of the names is missing, is common to git diff and other Git commands.)

¹Git's git diff can produce something called a combined diff but these are rather complicated, and not relevant here.

Brief aside: `git show` and `git log -p`

I mentioned git show above. What git show does is to find the parent commit automatically for you, and then show you first the metadata—the author (name, email, timestamp) and the log message—and then a diff from parent to child.

When you run git log -p, this is similar to running git show on each commit, starting from the child-most and working backwards (note that git log defaults to starting from HEAD). That is, first git log shows you the current branch's tip commit as if by git show HEAD, then it shows you that commit's parent as if by git show, then it shows you the parent's parent as if by git show, and so on.

There is one fairly big difference: git show will invoke the special combined diff machinery on any merge commits, while git log will just show the log message by default, skipping any attempt at diffing the merge. (There are flags you can use to change this behavior.)

Pull requests

Pull requests are more complicated, because in order to make a pull request, you must either open your repository to someone else who can run git pull²—this is where the term comes from, and is the original meaning of pull request—or else find or create a shared repository, push some of your commits to this shared location, and then ask the other person to obtain your commits from the shared location. I'll ignore the original meaning of "pull request"—essentially just an email message asking someone else to run git fetch—and jump into the way these sites handle it instead.

With services like GitHub and Bitbucket, there are now at least two other repositories involved. They even run a trial merge (though this is not so important, other than to verify that the pull request makes sense). I'm more familiar with GitHub than Bitbucket (I use GitHub myself), but both work the same way here, at least from a sufficiently high level view.

Before you can even think about pull requests, you must "fork" a repository. A fork is a clone, but with some extra memory about which repository it was cloned from.³ Behind the scenes, in a way that you normally don't have to care about,⁴ the provider does a lot of storage-sharing so that each fork takes very little space on the provider's servers.

This forking, though, is why there are two extra repositories involved. This gives us three repositories we must keep track of:

Your own Git repository, on your machine. This is yours, to do with as you will. This is where you run git diff, too.
Your fork of the original repository. Your machine, and your Git, will refer to this as the origin remote.
The original repository. This does not necessarily have any name in your repository. You can—and perhaps should—add another remote, which in other examples is called upstream. It's not always required that you add this, but let's assume you did. If you have not, run:
```
git remote add upstream <url>
```
where is the URL of the repository you forked your origin repository from.

We'll refer, below, to your repository, your origin, and your upstream. Remember that these remote names are actually just short names in your own repository referring to another Git at some URL. That's what a remote is: a short name for a URL where there is a Git repository at that URL. We'll use the word provider to mean GitHub or Bitbucket.

²The git pull command is meant as a short-cut for doing git fetch followed by a second Git command, all with one command. As it turns out, it's often important to use the two commands separately—not always, but often enough that combining them like this was probably a mistake. Probably, the command now named git fetch should have been named git pull, and the one now named git pull could be options you pass to git fetch, or a pair of convenience shortcut commands: git fm for fetch-and-merge, and git fr for fetch-and-rebase. I recommend that new Git users avoid git pull in favor of the separate commands, at least until they are quite familiar with the separate commands. Nonetheless, this slight historical error is fully baked into Git today, not only in terms of git pull being the obvious (but incorrect) opposite of git push, but also in the very name "pull request".

³This is over and above—or maybe "beside" is a better description—the way that clones remember their origin through the remote name origin. In any case forks are more like mirror clones initially, but are not slaved to the repository from which they are forked like mirror clones would be. So they're kind of a hybrid, with extra features—including, specifically, that you can make the service's version of a pull request.

⁴GitHub occasionally brings this up if and when you delete forks vs deleting unforked repositories, since (a) they have to undo the special fork sharing, and (b) deleting forks is safer in that the original (from which you forked) repository is still around. I imagine Bitbucket is similar.

A provider pull request starts with `git push`

The main thing to know about git push is that it pushes commits, not files. It does this by calling up some other Git repository. Then it finds out what commits you have that they don't, gives them your commits, and asks them to set some name(s), usually branch names, to remember specific commits.

Now, your fork at origin belongs to you, so you can git push to it however and whenever you like. It's a real, actual Git repository (or something that acts just like one), stored on the provider's machines rather than your own, but it's just like your own Git repository in that it has commits, and branch names, and those branch names point to tip commits that point back to previous commits.

When you run git push, your request to set a branch name, like master or develop or feature/tall, comes with a commit hash ID. If their Git doesn't have that commit, your Git gives their Git that commit. If their Git doesn't have that commit's parent, your Git gives their Git the parent, too. This continues on until you reach some commit their Git does have. Those are what you both shared before you started the git push.

The commit hash ID you give them is normally the one at the tip of your branch. So if you have:

 ...--H--I--J   <-- master

and you git push origin master, you are getting your Git to call up their Git and say "I'd like you to set your master to commit J". If their Git has their master pointing to commit H, and is missing I and J, your Git gives them I and J, too.

It's possible that their Git has their branch name pointing to some commit you don't have, or that isn't in the chain formed by starting from your branch. Maybe their Git has:

...--H--K   <-- master

If so, your request, that they add I and J and make their master remember J, will be denied by default, because this would result in:

       K   [abandoned]
      /
...--H--I--J   <-- master

after which they will "lose" commit K, possibly for real and forever. Since the Git at origin belongs to you, though, you can normally use a force push (git push --force) to turn your polite request into a command: yes, set your master to J even though that loses K! (Usually this is a bad idea and you shouldn't do it. Instead, you should git fetch origin to bring K into your own repository, and then either merge or rebase to incorporate K along with your own I--J. This gives you a new and different commit, or set of commits, that you can push politely, that won't lose K. Instead, they will be pure additions of new commits.)

Note that these changes—usually pure additions of new commits followed by moving a branch name "forward"—go into your fork. They affect your origin, but they do not affect your upstream. That's not your repository after all! You cannot push directly to your upstream.

Finally, the pull request

Instead, what you can do, now that your new commits are in your origin which is a fork of your upstream, is to make a pull request, typically using some web interface clicky button. The provider's server will know—you will tell it, if and as necessary—which branch name you want to use in your origin, and which branch name you want to use in your upstream.

The provider will then notify whoever actually controls the upstream that you have made this pull request. Since the provider has your fork—your origin—specially shared with their repository that is your upstream, they will have direct access to the commits you pushed to your branch, that are now at your origin's branch tip.

Seeing the diff

Now we have all the tools we need to find the correct diff. We want to compare their branch tip commit, from on the branch name you picked out when you made the pull request, to the tip commit in your upstream branch that you set when you ran git push. If you have those two hash IDs in front of you, you can run git diff <their-upstream-tip-hash> <your-origin-tip-hash>.

But hash IDs are terribly ugly. It would be nice if we could get Git to translate for us—and we can. I skipped over how git fetch works above, but let's dive into it for a moment.

Using `git fetch`

If you run git fetch upstream, that tells your Git to call up the Git that answers at the URL you stored under upstream. That's the Git for the upstream repository at your provider, the one you forked-from. Your Git will call up that Git, obtain any new commits they have that you don't, and drop them into your repository. Then—here's the key trick—your Git will set your remote-tracking branch names for upstream to record the hash IDs for each of their branch tips, per whatever they have right now.

Their master becomes your upstream/master. Their feature/tall becomes your upstream/feature/tall. Your Git remembers these for you, along with picking up any new commits they have.

The same holds when you run git fetch origin: your Git calls up the other Git at origin—this is your fork at the provider—and loads up any commits origin has that you don't. Then your Git sets your origin/master to remember the master at your origin, and so on. Note that when you git push to origin and give them updates, your Git knows if they take the updates. If they do accept your updates, your Git records the new hash IDs under origin/master, origin/develop, and so on.

Hence, as long as your Git is in sync with the two Gits at upstream and origin—and if it isn't you can just run git fetch to upstream and to origin to update it—you now have in your own repository the correct commits, named via upstream/theirbranch and origin/yourbranch. So, instead of git diff <magic hash 1> <magic hash 2>, if you've sent a pull request asking your upstream to incorporate your feature/tall into their develop, you can git diff upstream/develop origin/feature/tall.

Summary

The two commits you need to diff are those in two other repositories. If those two repositories are set up as remotes upstream and origin in your own repository, and your repository is up-to-date with respect to those two repositories, you can git diff or git log or git show the commits in question, and use your remote-tracking names upstream/* and origin/* to locate specific branch tips.

You can have commits that aren't in either of these repositories, and you can see what would happen if you pushed these new commits to your own origin. This allows you to see what would happen if you pushed them and then made a pull request: just compare your upstream/* remote-tracking name tip commits to your own branch tip commits.