Search code examples
gitgit-pullgit-fetch

Git pull vs fetch - no difference for newly fetched branches?


I've been reading about git pull and fetch commands and the difference between them.

I agree there is a difference between the 2 commands when we have master branches both locally and remotely and therefore pull will integrate whatever changes we fetched.

But what if new branches have been pushed to the remote that had never been fetched before. If we only use git fetch vs git pull what will be the difference internally from Git point of view after we have fetched/pulled those branches ? are the new branches not integrated if we only run git fetch ?

I wanted to test it and did the following:

I have a remote repository which I cloned twice, let's call those local repos repo 1 and repo 2 - repo 1 will create new branches and push them to remote and repo 2 will pull/fetch them from remote.

I created and pushed a new branch - side_branch_1 - to the remote repo from repo 1. Then I got back to repo 2 and used git pull. Then I ran git branch -a and saw the new branch as remotes/origin/side_branch_1. I also opened the .git/FETCH_HEAD file and saw the line for that branch: <sha-1> not-for-merge branch side_branch_1 of <url>.

After that, in repo 1 I created and pushed side_branch_2 and in repo 2 I used git fetch this time. Then I ran again git branch -a and saw the new branch as remotes/origin/side_branch_2. I also opened the .git/FETCH_HEAD file again and saw the line for that branch: <sha-1> not-for-merge branch side_branch_2 of <url>.

Is there no difference for new branches whether I pull or fetch ? And if yes then what is the difference from Git internal point of view ?

Because side_branch_1 is tagged as not-for-merge even though it has been pulled. Why ? What am I missing ?


Solution

  • TL;DR

    git pull means run git fetch, then run a second Git command. The first step—git fetch—does not affect any of your branches. It does not change anything you're working on, if you're working on anything.

    The second step, which defaults to running git merge, affects your current branch. It does not create a new branch, so in general, any new branch names created in the other Git are not relevant unless you explicitly named them on your git pull command.

    Assuming you run git pull with no extra arguments, the remote on which git pull runs git fetch is the remote associated with the current branch, and the commit that is used for rebase-or-merge is that associated with the upstream of the current branch as updated by the git fetch step. Git imposes limitations on the upstream setting for a branch name in your repository: in particular, if your Git is not yet aware that some name exists in the other Git, your Git won't let you set it as the upstream. So "new" branches—which we haven't properly defined, really—are not relevant.

    If you add more arguments to your git pull command line, the picture gets more complicated.

    Long

    Is there no difference for new branches whether I pull or fetch?

    Git pull always means: run git fetch, then run a second Git command. So obviously these are different because git fetch does not run a second Git command. It is irrelevant here whether or not the fetch step sees branch names that your Git has not seen before.

    And if yes then what is the difference from Git internal point of view?

    Here's where you need to be closely aware of how Git really works. To keep this answer short(ish), I'll say see a lot of my other answers for lots of detail, but:

    • Each commit has a unique hash ID, which is the long random-looking commit-name that git log shows you: commit 1c56d6f57adebf2a0ac910ca62a940dc7820bb68 for instance.
    • Each commit stores a snapshot of all of your files. The files inside each commit are in a special, read-only, Git-only, compressed format, frozen for all time.

    • Each commit also stores some metadata: information about the commit that isn't a file saved with the commit, but rather, holds stuff like who made the commit, when, and why (their log message). In this metadata, each commit stores the hash ID of its immediate parent commit (for most commits; some store two or more parents, and these are merge commits, and at least one will be the very first commit in the repository and therefore won't have a parent).

    • A branch name like master simply holds the raw hash ID of the last commit in the chain. Hence if you have a branch named master and some commits, master holds some hash ID H, and commit H points back to some earlier commit G, which points back to a yet-earlier commit F, and so on:

      ... <-F <-G <-H   <--master
      

      To add a commit to a branch, we select that branch name, which selects that commit. That takes the frozen, Git-only files out of the commit into an area where we can work on them. We work on them as desired and eventually tell Git: make a new commit. Git makes the new commit point back to the one we got out, saving a new snapshot of all of our files, and then, having made the new commit, changes the branch name so that it points to the new commit:

      ...--F--G--H--I   <-- master
      
    • Branch names are not the only kind of names that can remember commit hash IDs. More than one name can identify any single commit, too.

    The git clone command works by calling up another Git repository. You tell your system:

    1. Make a new, empty directory / folder (or use an empty folder that you point git clone to).
    2. Make a new, empty repository there: git init.
    3. Store a URL for later, under the name origin (or whatever other name you tell Git to use): git remote add.
    4. Do any other configuration you told Git to do using the git clone command.
    5. Call up another Git at origin—at the stored URL—and have it list out its branch (and other) names and their raw hash IDs. Then, ask that Git for the commits ... in this case, all of them. Copy all of its commits over into our otherwise-empty repository. Take its branch names and rename them: make its master become our origin/master, for instance, and make its develop become our origin/develop, and so on.
    6. Last, for one of these names—probably master—use the renamed origin/ version of the name to make a branch name, and point that branch name at the same commit as my origin/ version of the name.

    So after the initial git clone, you have remote-tracking names, usually of the form origin/*, for each of the other Git's branch names. You then have one branch name of your own, usually master, pointing to the same commit as your origin/master. If they have master and develop, perhaps you now have:

    ...--G--H   <-- master, origin/master
          \
           I--J   <-- origin/develop
    

    Step 5, in the six-step git clone sequence above, is in fact git fetch. However, rather than obtain every commit, what git fetch does is talk with the other Git, to see which commits they have that you don't. During the initial clone, you don't have any commits, so that's just automatically all of theirs. Later, it's their new ones.

    When you run git fetch later, if they still have their master identifying commit H and their develop identifying commit J, your Git will look in your repository, using the real hash IDs that H and J stand in for, and see that you already have them. Your Git does not need to get any new commits. If they've added another commit to their develop, though, they will have new commit K and you'll get it:

    ...--G--H   <-- master, origin/master
          \
           I--J   <-- origin/develop
               \
                K
    

    and then your git fetch will update your remote-tracking name origin/develop to point to commit K:

    ...--G--H   <-- master, origin/master
          \
           I--J--K   <-- origin/develop
    

    If they do something unusual and force their develop back one step and you run git fetch again, you will keep commit K for a while—typically at least 30 days by default—but your Git will adjust your origin/develop to match their develop:

    ...--G--H   <-- master, origin/master
          \
           I--J   <-- origin/develop
               \
                K   [no name: hard to find!]
    

    Git in general finds commits by starting from some name—whether it's your branch name, or your remote-tracking name, or any other name—and then working backwards.

    (There are hidden logs of previously-stored hash IDs for each name, by which you can find K. The entries in these logs eventually expire, and that's where the 30-day limit comes from: after 30 days, the entry retaining K expires. Some time after that, Git's garbage collector, git gc, will throw K out for real, if nobody has made a new name to protect it.)

    Running git fetch like this, with no name at all—defaulting to origin, usually—or with just the name of the remote such as origin, will—as long as you haven't set things up specially—obtain all of the branch names from the other Git, and create or update all of your remote-tracking names accordingly. However, setting up something called a single-branch clone configures your Git differently, so that git fetch only updates a single remote-tracking name. You can reconfigure this later, or override the set of names to update using a refspec, but we won't go into further detail here.

    So far, this is all about git fetch; let's start using a branch name

    Again, Git's fetch is the part that obtains new commits from the other Git. Having obtained new commits, if there were some to obtain, git fetch adjusts your remote-tracking names. It has no effect on any of your branch names. Your branch names are all undisturbed.

    If you never have any branch names of your own—which would be weird, though it is possible to do this—and never do any work on your own, which is less weird and sensible for certain applications (archival storage, for instance), that would suffice. But you probably do use branches.

    Let's say you make your own branch name, dave or whatever you like. Let's say you make this name point to existing commit H:

    ...--G--H   <-- dave, master, origin/master
          \
           I--J--K   <-- origin/develop
    

    Now that you have more than one branch name, we'd like to have Git remember which one you're actually using. We'll attach the special name HEAD to one of them:

    ...--G--H   <-- dave (HEAD), master, origin/master
          \
           I--J--K   <-- origin/develop
    

    So now we can tell that you're using the name dave and commit H. Three names, dave and master and origin/master, all identify commit H right now.

    We mentioned above that the files saved in commits are in a special, read-only, Git-only, compressed and frozen format that only Git can use. So Git has copied these files out, into both Git's index and a work area for you. The work area is your working tree or work-tree. It has ordinary files stored in your computer's ordinary format.

    You make new commits—usually anyway—by manipulating these ordinary files, then using git add to copy them back into Git's index. This re-compresses the file into the frozen format, ready to go into a new commit. When you run git commit, Git will package up the files that are in its index at that time. Hence we can say that the main function of the index is to store what you propose to put into your next commit. (It has other functions as well but we won't get into them here.)

    Eventually you have your files in shape, and git add-ed, and you run git commit. Git collects the appropriate metadata and writes out a new commit, which assigns the new commit its unique hash ID. Git then stores the new commit's hash ID into the current branch name, giving us:

              L   <-- dave (HEAD)
             /
    ...--G--H   <-- master, origin/master
          \
           I--J--K   <-- origin/develop
    

    You could equally well work on master, or develop that starts out pointing to commit K, or whatever, but one way or another, you make a new commit, and it points back to whatever commit you told Git to use to start with.

    Now, if you run git fetch and they, whoever they are, made or otherwise acquired new commits you have not yet seen, these new commits have been added on to their branches. Your Git sees them in their repository, sees that you do not have them yet, and gets them. Let's draw one (and stop drawing I-J-K as they're in the way, but the letters are used up so I'll go with M here next):

              L   <-- dave (HEAD)
             /
    ...--G--H   <-- master
             \
              M   <-- origin/master
    

    You might like to incorporate their new commit somehow.

    Exactly how you incorporate their new commit is up to you. You could, for instance:

    • git checkout master and then git merge origin/master
    • git merge origin/master right now while on commit L on branch dave

    or do any number of other things.

    If you:

    git checkout master; git merge origin/master
    

    though, your Git will do what Git calls a fast-forward merge. This is not a merge at all—it's somewhat poorly named—but it has this effect:

              L   <-- dave
             /
    ...--G--H--M   <-- master (HEAD), origin/master
    

    In fact, if you run git checkout master; git rebase origin/master, the same thing happens in this particular case. In other cases, different things may happen.

    This is where git pull comes in

    As a rule, once you've brought new commits over from some other Git with git fetch, you tend to want to do something with them. If you're on your master and they have updated their master, the thing you might want to do is update your master. The two most common ways to do that are to run either git merge or git rebase.

    The git pull command can be told to run either of those as its second command. The default is for it to run git merge. Both git merge and git rebase operate on the current branch. That is, they look at the special name HEAD. As long as that is attached to some branch name—as it normally is—that is the branch name of yours that they will affect. They make changes to Git's index and to your work-tree; both may change which commit is selected by the current branch name; git merge may make a new merge commit, or perform a fast-forward operation, or sometimes, do nothing.

    One of the parts I don't like about git pull is that you do not always know, when you hit Enter, exactly what commits git fetch will end up fetching, and where it may move any remote-tracking names. But you're dead set on running git merge or git rebase using those new commits and updated names. (This is technically off a bit, as we'll see—it doesn't use updated origin/* names directly—but it's close enough here.)

    Even if the new commits aren't something you want to use to affect your current branch, you're going to have this happen. You can't tell if it will happen. You could use some viewer to inspect the other Git repository first, but what happens if you view it, and then just before you press Enter, someone else changes things in that other repository? Still, people like this a lot, and use it all the time, so let's get to your detailed questions.

    I also opened the .git/FETCH_HEAD file again and saw the line for that branch: <sha-1> not-for-merge branch side_branch_2 of <url>.

    Here's the historical secret (or not so secret) about git fetch and git pull: they are so old that git pull itself existed before remote-tracking names like origin/master did. Remotes and remote-tracking names were invented some time between Git version 1.4 and 1.5, and there was some fumbling around with different ideas. The git pull command kept working the way people wanted it to, all throughout these transitional times as the newfangled remotes and remote-tracking names were being developed.

    To avoid having to change too much code too often, and/or because remotes and remote-tracking names didn't exist yet, git fetch has always written everything into .git/FETCH_HEAD. To let the early git pull scripts figure out which commit hash ID to give to git merge, git fetch notes which one of our branch names we're using now—that's the "where is HEAD attached" check—and what name(s) to use from the other Git. It then marks each .git/FETCH_HEAD line with not-for-merge, or doesn't mark it, depending on the arguments you gave to git fetch.

    When you run git pull, you can give a bunch of arguments to the git pull command:

    git pull                 # no arguments at all
    git pull origin          # just a remote
    git pull origin master   # a remote and a branch name *on the remote*
    

    Back when git pull literally ran git fetch, it passed these arguments on to git fetch. It now has git fetch built into it, but it still works the same. If you give one or more branch names here, that is, or those are, the ones that git fetch doesn't mark as not-for-merge in the .git/FETCH_HEAD file.

    Similarly, when git pull was still a shell script—it was rewritten in C relatively recently—this is how git pull decided which hash ID to pass to git merge or, if you choose git rebase as your second command, to git rebase. What it does now is more obscure. Since the fetch part is now built in as C-coded function calls, it can just retain the raw hash IDs in memory.

    In Git version 1.8.4, the Git folks decided that git fetch origin master should update origin/master. Before that, git fetch origin would update all remote-tracking names, but git fetch origin master would update none. From Git 1.8.4 onward, git fetch origin master updates origin/master. It still does not update other remote-tracking origin/* names, because it does not bring over commits corresponding to any updated names. (It could still update the remote-tracking names in some cases, but it just doesn't.)

    Conclusion

    The git fetch that git pull runs:

    • mostly gets the arguments you give: e.g., git pull xyzzy one two three runs git fetch xyzzy one two three. "Mostly" is only here because some options affect which second command to use, and/or are eaten by git pull itself, and/or are passed to the second command instead of being passed to git fetch.
    • fetches from the named remote (or from a given URL, but this changes many things), and thereby updates some set of remote-tracking names;
    • records everything it did in .git/FETCH_HEAD in case you are still using the old git pull shell scripts.

    In general, git fetch is safe to run at any time. (You can configure it to be unsafe, if you really wish, by setting remote.name.fetch inappropriately or passing an unsafe refspec argument. It's worth noting, though, that git fetch has built-in safety checks even if you do this. The old pull script turns them off!)

    The subsequent git merge or git rebase operates on the current branch and it tends to not be a good idea to let these happen if you have uncommitted work. Git will normally detect such a case, and prevent the second command from running at all for these cases. In the distant past, though, the pull command could (and did) wreck in-progress work irrecoverably, because git pull—the old script, anyway—turned off a lot of safety-checks.

    In any case, the second command—the merge-or-rebase step—gets a bunch of extra arguments that made it work the same during the Git 1.4 to 1.6 transitional period when remotes and remote-tracking names were changing. That was almost 15 years ago now, but it still works the same way. If you use:

    git fetch
    git merge
    

    and your Git makes a merge commit, the default merge message will be something like:

    merge branch origin/dave into dave
    

    but if you use:

    git pull
    

    the default merge message will be more like:

    merge branch dave of <url> into dave
    

    The "something like" is because the exact spelling of each message here depends on the branch names (obviously), and whether you're merging into master—this omits the into <branch> part—and there are some quote marks that get inserted that I didn't want to bother with here. :-)