Search code examples
gitgit-mergegit-commitgit-remote

What is the standard practice when you're done with a git branch?


I forked a repo, then created a branch called patch1 made changes and committed them, then I created a pull request upstream and got it merged upstream with master. But in my local repo the branch isn't merged, how do I pull the merge from upstream without getting the rest of the branches from upstream that I don't want/need such as Hotfix12 or NewFeature5?

What is standard practice here?


Solution

  • (First, a side note: I submit that standard practice is not necessarily all that interesting, because your Git repository is yours. You can do whatever works for you! So standard practice is only interesting if it works for you. You might have to try several different approaches.)

    TL;DR

    I recommend running:

    git fetch upstream
    git checkout desired-branch
    git merge --ff-only upstream/theirbranch    # or upstream/branch, or upstream/master
    

    (you can do the fetch and checkout in either order). If you really like git pull, you can make git pull run the fetch and merge, but I dislike git pull and prefer to do this in two or more steps. (I have an alias for the merge --ff-only as well, git mff.)

    You are now free to delete any or all of the names that you made to get this PR done, in both your own laptop Git repository and in your GitHub fork. These names use almost no disk space, but they will use "head space" (mental energy) to keep track of them, so I recommend deleting them.

    This --ff-only merge will fail in some cases; in those cases, well, see the long discussion below.

    Long

    Remember these things about Git:

    • Git is all about commits. Git is not about files, and even not about branches, it's about commits. Commits store data—snapshots of files—and metadata such as who made them and when. All commits are 100% read-only: no part of any commit can be changed. The true name of a commit is a big ugly hash ID, and that hash ID is exquisitely sensitive to every bit of data-and-metadata stored inside the commit, so that it is literally impossible to change the content of a commit: if you take one out, modify it slightly, and put it back, what you get is a new and different commit with a new and different hash ID.

    • Branch names like master and develop and so on are useful because they let you find commits. The true name of each commit is a big ugly hash ID that no human can remember. But we don't have to remember a big ugly hash ID, because we have a computer to remember it for us, under a name!

    • The word branch is ambiguous. Whenever someone talks about some Git branch, make sure you know whether they mean branch name, or something else. This is only indirectly related to your issue here, but is worth remembering at all times. See also What exactly do we mean by "branch"? In general, it's supposed to be obvious whether the word branch means branch name, or some collection of commits, ending at one specified by a branch name. Some people also use it to mean remote-tracking name (I will try not to do that here).

    • Your repository is yours. You have your own branch names. Your branch names are not any other Git's branch names. What your Git and their Git really share are the commits, by their hash IDs. (Since commits are 100% read-only, if your Git and their Git can literally share the commits, that's fine. If not, your Git and their Git can have separate copies. The copies can't change, as we already noted.)

    • Besides branch names, Git has more ways—that is, more names—by which it can remember any one particular commit's hash ID. One of these kinds of name is a remote-tracking name, like origin/master. A remote-tracking name is your Git's memory of some other Git's branch name (and the hash ID they have stored in that branch name).

    These last two items are the keys to dealing with your situation.

    I forked a repo ...

    This means that you used some hosting provider, such as GitHub, to make a second Git repository, based on a first Git repository. That is, on the GitHub side, you made a clone. Your clone-on-GitHub is now independent of their clone-on-GitHub.

    You probably also made a clone onto your own computer (your laptop or whatever), so there are probably three clones in existence at this point. That's fine! In Git-world, you get a clone, and they get a clone, and everyone gets a clone! There's no problem with having lots of clones ... well, except for one: every clone has its own branch names. That can be a lot of branch names to manage.

    There is something peculiar about the clone that GitHub in particular make, when you use the "fork this repository" clicky web button. In fact, there are several peculiar things, but the important one here is this: this clone copies all the branch names from the repository you're forking, to your GitHub clone. Your GitHub clone has only branch names and not remote-tracking names.

    If you subsequently ran:

    git clone <github-url>
    

    to copy your fork to a new clone on your laptop, this third clone did not copy all of "their" branch names. But hold on a moment: who is they here?

    • We already said that there are two interesting clones on GitHub. The meaning of they here depends on what URL you used. If you used the URL of the original repository, before you did a fork, the "they" is the original repository. If you used the URL of your fork, the "they" is your fork.

    • If you just now forked their repository, your fork has all the same branch names (and stored hash ID values) and all the same commits (with their unique hash IDs) as their fork. So in some sense, it does not matter which one you cloned. But over time, your fork and their fork may drift apart, as you and/or they add more commits to your and/or their repository. If you and they add different commits, or update your and their branch names in different ways, then it starts to matter.

    Typically, what you would do at this point is create two of what Git calls remotes in the clone on your laptop. A remote is just a short name like origin, where we'll have our (laptop) Git store the URL for some other Git repository. When you ran git clone <url>, your Git created this standard origin remote. Since there are two interesting repositories over on GitHub—your fork and their fork—you might well want to add a second remote, so that you have one remote for each fork. A standard name for this second remote is upstream. (It's not a particularly good name, because several other things in Git are called upstream at various times, but it's common enough, so we'll use it here.)

    Remote-tracking names

    Let's get back to the fact that your laptop-side clone didn't copy either fork's branch names to your laptop-clone's branch names, and look also at why the GitHub "fork" button did copy all of their fork's branch names to your fork. This all has to do with remote-tracking names.1 Your laptop Git creates remote-tracking names for every branch name that your laptop Git sees in the remote Gits. These remote Gits have names on your laptop: origin and upstream. So your laptop Git can stick those names in front of their branch names, and turn the GitHub Gits' master—there are almost certainly two of these—into origin/master and upstream/master. It turns the GitHub Gits' develop into origin/develop and upstream/develop. This repeats for every branch name in each remote.

    The cost of saving all these extra names is very low: it takes essentially no disk space at all. That's because Git is all about commits, and commits have hash IDs. Suppose origin/master says commit a1234567..., and upstream/master says commit a1234567.... Your own Git already has commit a1234567..., so all your Git has to store is some name-value pairs: origin/master=a1234567..., upstream/master=a1234567....

    The nice things about remote-tracking names, then, are these:

    1. They take essentially no space at all. (Git generally stores them in .git/packed-refs, which is a single file with records, rather than in multiple files, so they tend to take even less than a disk block. Your own branch names are already cheap storage-space-wise, as most of those are stored in a single disk block, but these are even cheaper.)

    2. They automatically update. When you run git fetch origin, your Git calls up the Git at origin (your fork over on GitHub). Your Git gets from their Git any new commits and other objects required, then updates all of your (laptop) origin/* remote-tracking names to match all of your (GitHub-fork) branch names. When you run git fetch upstream, your Git calls up the Git at upstream (their fork over on GitHub). Your Git gets from their Git any new commits and other objects required, then updates all of your upstream/* remote-tracking names to match all of their branch names.

    You might want to add --prune to your git fetch commands, or set fetch.prune to true in your Git configuration, so that your Git removes from your remote-tracking names any branch names that "their" Git (your or their fork on GitHub) no longer has. Without --prune, the update in step 2 above never notices that they, whoever they might be, deleted feature/tall, so your origin/feature/tall or upstream/feature/tall—whichever it is—hangs around as a stale remote-tracking name. With --prune or fetch.prune, your laptop Git notices that this name should go away, and removes it.

    So: why didn't the GitHub "fork a repository" button create remote-tracking names instead of branch names? Well, only GitHub can really answer that; but if they had, you'd need some way to manipulate remote-tracking names on GitHub. Since they didn't, they only have to provide a way for you to manipulate branch names on GitHub. Note that GitHub do not have a clicky button for fetch: you cannot make your GitHub fork run git fetch! Since it's git fetch that you use on your laptop to update your remote-tracking names, the lack of fetch on GitHub means you don't have a way to update remote-tracking names there.


    1Historically, remote-tracking names actually came after the various decisions that led to all of this, but I think it makes more sense to follow the logic the other way.


    Transferring commits: git fetch and git push

    There are two common ways to get commits into a Git repository. We've already mentioned one of them above, namely git fetch. You run git fetch remote, and your Git fishes out the stored URL from the remote-name—e.g., the URL for origin—and calls up a Git at that location.

    That Git lists for your Git all of its branch names (and tag names, and other internal names, but here we're only really looking at branch names). Each branch name identifies one commit, which is the tip of the branch. All earlier commits that are reachable on that branch, are accessible using that branch name. For a thorough discussion of the concept of reachability, see Think Like (a) Git. Understanding reachability is a key to using Git, so you should definitely work through this stuff if the concepts are unfamiliar.

    At this point, your Git can ask their Git for any commits and other internal Git objects that your Git wants or needs, but does not have. This step is actually pretty interesting and gets into a lot of graph theory, but we can just take for granted that the two Gits do this pretty well. They figure out a reasonably minimal set of Git objects that their Git has, that your Git wants. They compress these objects—that's what all the counting objects and compressing objects messages here are about—and send them over. Your Git puts these into your collection, adding the commits and other internal objects to your repository on your laptop. That allows your Git to update your remote-tracking names: you now have all the commits they have, plus any commits you have that you haven't given to them.

    Note that your remote-tracking names are, in effect, pre-reserved for their Git. You don't call any of your own branches origin/master or origin/develop or the like.2 So Git can freely smash and replace any or all of your remote-tracking names: none of your branch names are affected.

    If you want to go the other way, the opposite of fetch is push.3 But there's an asymmetry here. When you run git push origin branch, you have your Git call up some other Git, again by looking up the remote's URL. But this time, instead of having them list out their branch names and such and bringing commits to your Git, your Git sends them commits and other internal Git objects. You send to them any commits they need to make the tip commit of your own branch branch useful—this includes any reachable commits that you have, that they don't—and once again we get all the counting and compressing objects messages. But now, having sent any required commits to their Git, your Git asks—usually politely—that they should set their branch name branch to the hash ID of the commit that's also the tip of your branch branch.

    They don't set a remote-tracking name! (GitHub in particular do not even have remote-tracking names.) They don't set some other reserved-space name. They set their branch name.

    When your Git makes a polite request, they'll refuse the request if they don't like it. If you're creating a new branch name, they will usually like that. If you're updating an existing one, though, they won't like the update if the new hash ID refers to a commit from which the previous hash ID of that same name is not reachable.

    That is, consider some chain of commits:

    ...--G--H   <-- branch
    

    Now we'll add some new commits to the end:

    ...--G--H   <-- branch
             \
              I--J
    

    and propose that they move their name branch from H to J. If they do, commit H remains reachable: starting at J and working backwards, we go from J to I and then to H. So this request will be accepted. But if we do this instead:

    ...--G--H   <-- branch
          \
           K--L
    

    and ask them to set their name branch to point to L, they'll refuse, because H cannot be reached from L. The reachable commits from L are L, then K, then G, then other commits before G.

    Git's term for this is that the change to the name branch must be a fast-forward. Moving branch from H to J is a fast-forward; moving branch from H to L is a non-fast-forward.4


    2Technically, you can. Git has internal namespaces so that Git can keep it all straight. It's not a good idea though: you probably don't have these internal namespaces, and you will mess it up. 😀

    3It's not push and pull, it's push and fetch! This is something of a historical accident and I think it leads to a lot of confusion, but it is what it is.

    4To force-push a non-fast-forward update, you can use the --force flag, or add the + flag to a refspec, which is a thing we have not defined here. Both of these change the polite request into a command. They could still refuse, but we won't worry about these details here.


    Pull requests

    Pull requests (PRs) themselves are a host-provider-specific feature. Git does not have pull requests! (Git has a git request-pull command, but what it does is generate an email message.) Note that if we own a GitHub fork, we can git push to it. That's all fine: we can update our fork. Our git push operations will succeed if they're fast-forwards, and in special cases, we can git push --force to make our operations succeed even when they're not fast-forwards. So we can git push all we want, to our GitHub fork, which we call origin. That lets us change the shape of our GitHub fork all we like. Our fork will store commits, like any Git repository. It will store them under branch names, like any Git repository. It does not have remote-tracking names—those are specific to our laptop Git repository–but that's fine: we don't need our fork to have remote-tracking names.

    But we might want to get our commits into the GitHub fork that's not ours, at the URL we are storing under the name upstream. How will we do that?

    If—this is a big if that's generally not true—the owners of that other GitHub fork were to give us write permission to their repository, we could just git push our commits directly to upstream. But they would have to really trust us with their repository.

    GitHub could offer some sort of special name-space: semi-protected branch name patterns that the owners of the upstream fork could give to us to write on, that they won't use themselves. GitHub could have an enforcement mechanism to make all of this work. But they don't. Instead, GitHub give us pull requests.

    Before we make a PR, we start by git push-ing our laptop-made commits to our own GitHub fork at origin. These commits go in by their hash IDs, updating branch names in our GitHub fork according to our git push commands. Eventually we have, in our GitHub fork, some branch name that points to some tip commit that we like, that we want to offer to the people who operate the GitHub fork we call upstream.

    It's at this point that we make the pull request. We use GitHub's interface to send commits from our GitHub fork to their GitHub fork (as if by git push), but they show up in the fork at upstream under special names that the GitHub folks control.5/sup> We have no agency in this process beyond clicking the "make a PR" button: GitHub decide on the special name, and create the name for the PR. GitHub then also send email, slack messages, etc.—whatever might be appropriate—to alert the people who run the fork we call upstream that there is a new pull request for them.

    Everything is now up to them.


    5These are the refs/pull/* namespace. The things in this namespace are numbered: each PR or issue gets a unique counting number in a GitHub repository, and when we make a new PR, GitHub give it a number—let's say 123 for concreteness—and create names of the form refs/pull/123/head and maybe, or maybe not, refs/pull/123/merge. The merge name is created if and only if the GitHub side software decides that our PR can be merged; in this case, the merge ref points to a merge commit that the GitHub Git already made. The head ref refers specifically to the commit at the tip of the branch that we chose when we clicked on the "make pull request" button.

    If we push new commits to our PR, the head ref gets updated, and the merge ref gets destroyed and a new one created if possible, using the same rules as always.


    After a PR

    At this point, whoever controls the GitHub fork we call upstream has various options. They have a pull request, which has a number. They can inspect the pull request. They can bring it into a Git repository on their laptop, using git fetch and the special names that GitHub created for the PR (see footnote 5). Or they can just use the various clicky buttons on the web interface.

    If they do use those clicky buttons, GitHub in particular offer three buttons, which GitHub label this way:

    • Merge. This does a straight git merge, the same way Git would do it.6 All of your commits, with their hash IDs, are now reachable from whatever branch they merged into. One new merge commit exists on their branch; this new merge commit does not exist anywhere in your Git repository yet.7

    • Squash and merge. This in effect runs git merge --squash, though not exactly as Git would, because in command line Git, git merge --squash does not actually commit anything. in this case, they make one new commit on their branch that merges your work, but they do not take any of your commits.

    • Rebase and merge. This in effect runs git rebase --no-ff, copying all of your commits to new ones with new and different hash IDs.

    This, finally, brings us to your question:

    how do I pull the merge from upstream without getting the rest of the branches from upstream that I don't want/need

    The answer to this depends on what you want in each of your two repositories: your GitHub fork and your laptop repository.

    If they did a real merge, you can do:

    git fetch upstream
    git checkout desired-branch
    git merge --ff-only upstream/theirbranch
    

    because your commits from your branch named branch are now in their branch. All you need to do is add that final merge commit. You no longer need any extra names to remember branch tips you were using to create and send the pull request, so feel free to delete those.

    If they did a squash and merge or rebase and merge, this --ff-only will fail. It's now up to you: do you want to abandon your original commits, in favor of whatever commit(s) they put into their upstream/theirbranch? Whether you do or not, you now have all of Git's tools available to you. Their commit(s) are on upstream/theirbranch: you can view them with git log. Your commits are reachable via your branch names. You can use git branch -f or git reset --hard to discard some or all of your commits. You can rename your branches to keep your old commits while you make sure that theirs work. You can do whatever you like! Your repository is yours, after all.


    6In fact, since GitHub already did it—it's in refs/pull/number/merge—they don't actually have to do anything here. If the merge had conflicts, this ref does not exist and the "merge" option is disabled.

    7Because GitHub use the pre-made merge, technically, you could grab that merge and have it in your repository. I do not know for sure whether GitHub use the existing merge—they could but don't have to—but it would be possible to figure this out by experimentation. Note, though, that GitHub could choose to change how this works at any time, so it's probably unwise to count on it one way or another.