Search code examples
gitgitlabbranchgit-mergebranching-and-merging

How to properly clone and switch to existing branch of fork


I had been working on a branch of a fork associated with a merge request and I was stupid enough to accidentally delete my local git folder. Luckily, all code changes had already been pushed, but I can't figure out how to properly recreate the state of the folder at the time of deletion.

I had originally cloned a project using

git clone [origin_URL]

after which I made some local changes, created a fork on GitLab, added it using

git remote add fork [fork_URL]

then created a branch, pushed to my fork

git checkout -b new_feature
git add [files]
git commit -m [message]
git push fork new_feature

and created a merge request from my_user_name/project:new_feature into other_user_name/project:master. The merge request had seen some discussion and some more commits had been made and I was ready to implement the finishing touches before getting on with other work but that was before I realized I had accidentally deleted my local folder.

“Not a big deal,” I thought, “it's all on GitLab anyway, I'll just have to get back to where I was” but now I've spent the last 2 hours trying to figure out the proper way of cloning the repository and configuring the branches the right way, all to no avail. I have been reading through a couple of other Git questions on SO found on Google and the SO search, none of which seemed to provide an answer to this particular problem so I think this shouldn't be a duplicate but of course I can't be too sure with the sheer amount of Git questions on here.

I've tried a couple of variants of git clone, cloning the original repository first and using git remote add fork, cloning the fork, renaming it as such and adding the original repository as origin, cloning with --branch… I've tried getting back onto the branch I was working on before using git checkout but all attempts so far ended up with a “detached head” state I don't know how to get out of or Git forcing me to create a new branch when all I want to do is get back onto the branch I was working on. I tried using git switch fork/my_feature but got

fatal: a branch is expected, got remote branch 'fork/my_feature'

Since the request had been opened, there had been some (unrelated) activity on the master branch, so the source branch is some commits behind the target branch which means I'll need to rebase the source branch onto the target branch – I don't know enough about Git to know if this is relevant to the problem, so I thought I'd mention it. Any insight from someone experienced enough with Git to tell me why the previous approaches failed would be greatly appreciated.


Solution

  • TL;DR

    You're on the right track. You just need to use git switch -c my_feature --track fork/my_feature. The reason why is at least a little bit messy, and there may be several other ways to deal with this, depending on your personal tastes, but the above should Just Work.

    If you just want to view that commit, you can tell git switch that it is OK to use the detached-HEAD mode:

    git switch --detach fork/my_feature
    

    In general it's not a good idea to do new work like this though, as it's too easy to lose track of it.

    Long

    “Not a big deal,” I thought, “it's all on GitLab anyway, I'll just have to get back to where I was”

    This is right—that's the point of distributed development, after all: there's more than one copy of the repository. The tricky part is getting back to where you were, though. It's not hard, but it is tricky, with the difference being that people can use Git without understanding what's really going on. That's because we (humans, rather than computers) like to think of Git as being about files and branches, and it's not. Git is all about commits.

    The problem is, commits are identified, in Git, by hash IDs: big ugly random-looking strings of letters and digits, such as d2ecc46c0981fb829fdfb204604ed0a2798cbe07. Every commit gets one of these, and no commit ever shares it with any other commit.1 What this means is that these hsah IDs are the commits, in a very real sense. You just present the hash ID to your Git, and it fishes out the commit, if it has it; or if it does not have it, you know that you need to copy that commit from whichever Git or Gits do have it.

    So even though these are what Git uses, these hash IDs are no good for humans and are not how we interact with Git. They are also quite useless for doing any new work: their only purpose is to locate and extract existing work. Remember that every commit represents a snapshot, frozen in time. That is, each commit stores all of your files. Every commit has a complete copy of every file. These copies are de-duplicated, which is a good thing. They are then stored in a frozen, compressed form that only Git itself can read and use. Overall, this is good because it means that each repository is (normally) relatively small: the repository does not grow immensely fat as we make more commits, because each commit really just keeps re-using the previous files.

    But this does mean that we literally cannot work on a commit. We have to have Git extract that commit. That copies the committed files to a form we can use. This is what git switch, and to a lesser extent, git restore, are about. Note that in Git versions predating 2.23, these are combined into one big git checkout command.


    1This is generally even true across entirely-independent repositories. The commit hash ID quoted above is in the Git repository for Git itself, so that if you have a clone of that repository and it is reasonably up to date, you will have that commit in your clone of that repository. That hash ID won't be in your GitLab repository, though. So these hash IDs are universally unique.

    They don't have to be—they just have to be unique within the set of repositories that you'll connect to each other via git remote add and the like—but in general, they are. The chance of any two individual hash IDs colliding accidentally is just one in 2160. However, the birthday paradox means that the chance rises very fast as the number of commits increases, to the point where it's statistically significant if you have more than a few trillion commits or so. It's also possible for a malicious actor to craft a collision on purpose, though Git is accidentally immune to some known collisions. In any case, SHA-1 is no longer considered cryptographically secure, so Git may eventually move to a 256-bit SHA.


    Names

    While Git is all about commits, we humans like to think in terms of branches. There is a big problem with this word branch: we humans use it ambiguously. Sometimes, when we say branch B, we mean one commit, found by our name B. Sometimes, when we say branch B, we mean every commit up to and including the last commit on a series of commits whose last commit is found using our name B. Sometimes we mean the name B itself, and sometimes we mean a specific subset of the general names that Git provides.

    (See also What exactly do we mean by "branch"?)

    Git has multiple different kinds of names. These include branch names like master or feature/tall, tag names like v1.0 and v2.17.2; and remote-tracking names like origin/master or fork/my_feature. All of these names are treated pretty similarly, and wind up being handled as a general form that Git calls refs or references.

    To keep track of each reference and what kind of name it is, Git's refs are stored in name spaces. The refs/heads/ namespace holds all of our branch names, so master is actually just a short way to spell refs/heads/master. Tag names are in refs/tags: v1.0 is just short for refs/tags/v1.0.

    Remote-tracking names are the most complicated of these, but follow exactly the same pattern origin/master is refs/remotes/origin/master and work/my_feature is refs/remotes/work/my_feature. The slightly tricky part is that refs/remotes/ itself is split into refs/remotes/origin/* and refs/remotes/work/*. This comes about because of the remote names origin and work. We'll get back to this later.

    Each of these names just stores one hash ID. That's all it has to do, so that is all that Git does with it:2 the name master means some commit hash ID, and the name work/my_feature also means some commit hash ID—probably a different one, but two different names can mean the same hash ID. Branch names, however, have one very special feature.

    When we use git switch to get "on a branch" like master or develop, that branch name becomes the current branch. Git extracts the right commit for us: Git looks up the hash ID, in Git's name-to-hash-ID table, and copies out the commit's frozen files, into a work area with ordinary files. This allows us to view and edit the files. At the same time, though, Git stores the name into the special Git ref HEAD, so that git status, for instance, will now say on branch master or on branch develop.

    Being "on a branch" gives us a happy property. Describing this property correctly requires that we take a second look at the anatomy of a commit.


    2Branch names have other functions as well, but they're handled by storing the branch name and its other data in your .git/config file, not by the name-to-hash-ID mapping part that is part of every Git reference.


    Commits store data and metadata, and contain hash IDs

    We said above that each commit stores a full snapshot of all files, and that's still true. That's the main bulk of the data of a commit: a saved file-tree, in which all of the files are in the special, read-only, Git-only, frozen and compressed format. We also said that each commit has a unique hash ID, and that is also true. What we left out is that each commit contains some metadata: some information about the commit itself.

    We see some or most of this metadata when we run git log, e.g.:

    $ git log --format=fuller -1 | sed 's/@/ /'
    commit d2ecc46c0981fb829fdfb204604ed0a2798cbe07
    Author:     Junio C Hamano <gitster pobox.com>
    AuthorDate: Sun May 24 18:13:53 2020 -0700
    Commit:     Junio C Hamano <gitster pobox.com>
    CommitDate: Sun May 24 19:39:40 2020 -0700
    
        Hopefully final batch before 2.27-rc2
    
        Signed-off-by: Junio C Hamano <gitster pobox.com>
    

    (I used --format=fuller here to show more than we usually see). In fact, this is just a cleaned-up version of the raw data that's inside the commit, which we can view directly:

    $ git cat-file -p HEAD | sed 's/@/ /'
    tree e83aacc68752967a710fc32e3cf49356959545eb
    parent ea7aa4f612ef33ecfb7fd6d488d949da3a51a377
    author Junio C Hamano <gitster pobox.com> 1590369233 -0700
    committer Junio C Hamano <gitster pobox.com> 1590374380 -0700
    
    Hopefully final batch before 2.27-rc2
    
    Signed-off-by: Junio C Hamano <gitster pobox.com>
    

    The tree line represents the saved snapshot. The author and committer lines give the name of the person who made the commit: the author is whoever wrote it, and the committer is the person who added it to the Git repository.3 The parent line gives the raw hash ID of the commit that comes before this commit.

    Every commit has some of these parent lines. In fact, if we look at the commit that comes before the HEAD one:

    $ git cat-file -p ea7aa4f612ef33ecfb7fd6d488d949da3a51a377 | sed 's/@/ /'
    tree 0342252fde5f2b5721299d321d57ce12542b2957
    parent d55a4ae71d515e788e5afb355a20c4b262049cac
    parent 1eb73712360744b552f30a6961c03d05bc44bef2
    author Junio C Hamano <gitster pobox.com> 1590374380 -0700
    committer Junio C Hamano <gitster pobox.com> 1590374380 -0700
    
    Merge branch 'dd/t5703-grep-a-fix'
    
    Update an unconditional use of "grep -a" with a perl script in a test.
    
    * dd/t5703-grep-a-fix:
      t5703: replace "grep -a" usage by perl
    

    we see that it has two parent lines. That marks this commit as a merge commit, combining two different series of commits—two lines of work.


    3This specific split allows for emailed patches, which were much more important in the 2005ish time frame, when Linus Torvalds first wrote Git: see commit e83c5163316f89bfbde7d9ab23ca2e25604af290.


    These interconnections form backwards-looking chains

    When a name like master contains a hash ID like d2ecc46c..., or a commit like d2ecc46c... or ea7aa4f6... contains the hash ID of some earlier commit, we say that that name, or that commit, points to the target. So the name master points to d2ecc46c..., which in turn points to ea7aa4f6.... We can draw this:

    ... <-ea7aa4f6 <-d2ecc46c   <--master
    

    In fact ea7aa4f6 points back to two different commits:

    ...--d55a4ae7--ea7aa4f6--d2ecc46c   <-- master
                  /
     ...--1eb73712
    

    In general, if we let round dots o or uppercase letters stand in for the random-looking and utterly forgettable hash IDs, we get more useful pictures:

    ...--o--o--o   <-- master
    

    or:

      ...--D--G--H   <-- master
             /
    ...--E--F
    

    which is a more digestable, and hence easier-for-humans, way to think about branches and branch names. The key takeaways here are that the branch name points to the last commit in the branch, and that commit points back to the earlier commits that are also contained in the branch. So given the drawing above, all commits up through H are on master. At some point there was a name, dd/t5703-grep-a-fix, pointing to commit F:

      ...--D--G--H   <-- master
             /
    ...--E--F   <-- dd/t5703-grep-a-fix
    

    That name is not needed any more because Git finds commits by starting with some name—such as master—and finding the last commit, then using that to work backwards. From commit H, Git works back to G; from G, Git works back to both D and F; so Git can find F without a separate name for it.

    For these purposes, any name is as good as any other. A branch name like master, or a tag name like v2.17.2, or a remote-tracking name like dd/t5703-grep-a-fix (I'm guessing that this was a remote-tracking name), all just serve to locate one specific commit, and that's all we need for these purposes.

    The special feature of a branch name

    What makes a branch name special is that we can get "on" the branch, using git switch or, in Git predating 2.23, git checkout. We cannot get "on" a tag or remote-tracking name: instead, we get a detached HEAD (git checkout) or a complaint (git switch):4

    fatal: a branch is expected, got remote branch 'fork/my_feature'
    

    But we can get on a branch:

    git checkout master
    

    after which we can draw our graph like this:

    ...--G--H   <-- master (HEAD)
    

    If we now do some work and make a new commit, here's what happens:

    1. do some work: we modify files in our work-tree, then run git add to copy the updated files back into Git's index.
    2. git commit: Git creates a new commit, which gets a new and unique hash ID. We'll call this commit I, using the next letter after H.

      • Git collects the appropriate metadata: user name, email, log message, current date-and-time, and so forth. Included in that metadata is the raw hash ID of commit H, which is the current commit because master is the current branch name because HEAD is attached to master. So the parent for new commit I will be H.
      • Git freezes, for all time, all the files that are in its index (aka the staging area). We won't go into details here, but note that the index started out matching commit H.
      • Git writes out the new commit, which acquires its hash ID at this time. The hash ID is based on both the data and all the metadata, including the exact second at which you ran git commit. It looks random, but it's simply a checksum of all of this data.
      • Last, Git does the special trick. Before I describe it, let's look at the graph.

    Since new commit I's parent is existing commit H, commit I points back to H:

    ...--G--H
             \
              I
    

    But what about the name master, which used to contain H's hash ID? Well, because HEAD is attached to master, Git writes I's hash ID into the name master. So now master points to I, not H:

    ...--G--H--I   <-- master (HEAD)
    

    No existing commit has changed at all. It's physically impossible to change any part of any existing commit, because the actual hash ID for H is the checksum of all the bytes in commit H. Those are not allowed to change! If we take H out of the repository, fiddle with some bytes, and put the result back, that's just a different commit H' with a different hash ID. Commit H will still be there. Commit I will point to commit H, because now that commit I exists, no part of it can be changed either.

    So, the special feature of being "on a branch" is that as we make new commits, Git automatically updates the branch name. The name that gets updated is the name we have HEAD attached-to. We can make extra names:

    ...--G--H   <-- master, develop
    

    We pick one of these names and attach HEAD to it:

    ...--G--H   <-- master (HEAD), develop
    

    Then we make a new commit and get:

              I   <-- master (HEAD)
             /
    ...--G--H   <-- develop
    

    If we make another commit, that continues to extend the branch:

              I--J   <-- master (HEAD)
             /
    ...--G--H   <-- develop
    

    No existing commit changes, and no other branch name moves. Only the branch name we're "on"—as in git status says on branch master—moves.

    If we now switch to develop, we get commit H back out in our work area (and in Git's index / staging-area):

              I--J   <-- master
             /
    ...--G--H   <-- develop (HEAD)
    

    and now, if we make two new commits, they make the name develop move accordingly:

              I--J   <-- master
             /
    ...--G--H
             \
              K--L   <-- develop (HEAD)
    

    and now we have our familiar branch-y structures. Note that commits up through and including H are on both branches. Commits I-J are currently only on master, and K-L are currently only on develop. The name HEAD is attached to the name develop, telling us that the current branch name is develop and the current commit is commit L.


    4You can see here that Git called this a remote branch rather than my preferred term, remote-tracking name. Given how badly overloaded the word branch is in Git, I think remote-tracking name is a better phrase to describe the name fork/my_feature. Both mean the same thing here though.


    remotes, git fetch, and remote-tracking names

    When we have two or more repositories that are supposed to hold the same commits, we need to have them talk to each other. In general, to achieve this, we give each "other Git" a name. This name is a remote.

    Most of the time, we get our first and only remote automatically. We make our own local Git repository, not by running git init, but by running git clone:

    git clone ssh://git@github.com/project/repo.git
    

    for instance. The git clone command is actually just a fancy wrapper that runs six commands for us:

    1. mkdir, to create a new empty directory, plus an internal chdir into the new directory for each of the subsequent commands;
    2. git init, to create a repository within this empty directory;
    3. git remote add origin url, to create the remote name origin and use that to store the url;
    4. git config if / as needed (mostly if we specify particular configuration items with our git clone command);
    5. git fetch origin; and finally
    6. git switch -c master --track origin/master, or something very similar (see below).

    When this all finishes, we are left with a non-bare repository—a repository that has an associated work-tree, where we can do our work—that has the master branch checked out into its work-tree,5 which is at the top level of the new directory. The repository proper is the .git sub-directory and all of its files.6

    The fact that we have a remote named origin is where our origin/* remote-tracking names come from. Step 5 above is for our Git to run git fetch origin. This has our Git call up their Git, using the URL saved in step 3. Their Git then lists out, for our Git, all of their branch and other names, and the corresponding commit hash IDs. Our Git mostly throws away the non-branch names, except for tags, which are handled in a complicated way that we won't cover properly here. Our Git takes the branch names and renames them. Their master becomes our origin/master, for instance.7

    The full name of each of these renamed branch-names is a remote-tracking name, i.e., is in that remote-tracking name-space we mentioned earlier: their refs/heads/master—a branch name—becomes our refs/remotes/origin/master: a remote-tracking names. For every one of their refs/heads/* names, we get a refs/remotes/origin/* name.

    The hash IDs their branch names hold become the hash IDs that our remote-tracking names hold. For our remote-tracking names to hold these hash IDs, though, we must first obtain the commits. So before we actually use these renamed names, our Git tells their Git: Please send over those commits, and also all their ancestors.

    The result is that we get every commit that they have, except perhaps for some unreachable commits, or commits that are only reachable from a non-branch name.8 So after our:

    git clone <url>
    

    we have a new repository that has every commit they have—or almost every commit—and has changed their branch names into our remote-tracking names. The last step, step 6 above, has created in our Git repository, our one and only branch, named master.

    We can at any later time run:

    git fetch origin
    

    and our Git will call up their Git and have them list out all their names, just as before. Just as before, we'll find any commits from these names that we want, but do not yet have, and get those commits from their Git. Then we'll adjust our remote-tracking names, origin/*, to match their branch names. The end result of this is that git fetch origin obtains origin's new commits and updates our memory of their branch names. It does not touch any of our branch names at all.

    We can use:

    git remote add fork <url>
    

    to add a new remote named fork that stores the given URL. Having done that, we then run:

    git fetch fork
    

    to have our Git call up their Git, have them list out their branch (and other) names, and have our Git rename those names to our remote-tracking fork/* names. We will get from them any commits that they have, that we don't, that we need to update our fork/* names, and then update our fork/* names to remember their branch names.


    5We still have to do our own cd into the directory because of the way current-working-directory works in Linux. In theory, on some OSes, git clone could adjust the shell's working directory, but given that people don't expect it to do that, it doesn't do that—it just runs each of its other five commands in the new directory.

    6You can, if you really want, later move the repository proper somewhere else. If you do your own commands instead of having git clone do them for you, you can set this up initially. Nobody really works this way, as far as I know, and the feature that allows it is not particularly useful for ordinary work—it's meant for internal use, to handle submodules in the modern Git way, as opposed to the old Git 1.7 style.

    7You can change all of this. For instance, using git clone --mirror makes a bare clone—one with no work-tree, which means you can't do any work in it—in which our Git slavishly copies all of their names. The underlying mechanism here is very flexible, but in practice, it's mostly used to handle three particularly interesting special cases. The only one we're covering here is the normal everyday full-clone-with-work-tree case.

    8An unreachable commit is one that cannot be found by starting with a name and working backwards through the parent links. A commit that is reachable by some funky GitHub-specific name, such as refs/pull/123/head, might also not come over. We can, by configuring the fancy mechanisms mentioned in footnote 7, arrange to bring over pull-request commits too, though.


    Their branch names are not our branch names

    While this is already covered above, it's worth emphasizing it again. Now that we have two remotes, origin and fork, we have two sets of remote-tracking names. We have an origin/master and a fork/master, provided that both origin and fork have branches named master.

    Their masters may have different final commits—and different earlier commits too—than our master. Of course, if we just now ran git clone origin, it is likely that our master matches their origin/master: that both point to commit H, for instance. It's fork/master that is more likely to be different from these two.

    In any case, though, if we want to have other branch names in our Git repository, we can now create them. Our branch names are ours. We can do whatever we like with them, creating and deleting them whenever we want. The only constraint on our branch names is that each branch name must point to some actual, existing, valid commit hash ID in our repository. At the moment, we have the commits we got from the other two repositories, so we can make any branch name we like, pointing to any of these commits.

    "Do What I Mean" mode

    The git switch command takes a branch name:

    git switch master
    

    Given that the name master already exists, this means select the name master and the commit to which the name master points. Git will try to attach HEAD to that name, and extract that commit's frozen snapshot into our work-tree.

    But you can give git switch the name of a branch that does not exist yet!

    Suppose that origin/develop exists, and suppose further that fork/develop does not exist, or that we have not done git remote add fork .... That is, we have something like this:

    ...--G--H   <-- master (HEAD), origin/master
          \
           I   <-- origin/develop
    

    Then:

    git switch develop
    

    would fail, because we don't have a develop—but before it fails, Git first checks: Do I have exactly one remote-tracking name that indicates that the other Git has a develop? In this case, because there is an origin/develop and no fork/develop, that's the case: there is exactly one other develop.

    Our Git then says: Aha, you mean you want me to create develop using the commit identified by origin/develop. So our Git does that—it creates the name develop, pointing to commit I, and switches to it:

    ...--G--H   <-- master, origin/master
          \
           I   <-- develop (HEAD), origin/develop
    

    This "do what I mean" mode is optional, and also has some fancier features, but it defaults to "on" and behaves as described above. It basically turns the git switch you gave—git switch develop—into:

    git switch -c develop origin/develop
    

    which is an explicit request: check out the commit identified by the name origin/develop (in this case commit I) and at the same time, create a new branch name develop pointing to this commit.

    In your case, you have something more complicated. For instance, maybe you have:

    ...--G--H   <-- master (HEAD), origin/master
          \
           I   <-- origin/my_feature, fork/my_feature
    

    If you now run:

    git switch my_feature
    

    your Git will complain, because there are two names it could have used to create my_feature.

    Note that in this case, either name would have worked fine—as far as what I'm showing, anyway. But git switch, or the older git checkout, won't do what you want here, simply because there is more than one matching name.9 So an explicit git checkout -c name --track remote-tracking-name is usually the way to go.


    9There is a configuration setting you can change to adjust this, but this answer is already very long, so I won't cover that here either.


    Further reading

    The verb track in Git is badly overloaded. The remote-tracking names already use this verb, in that when you run git fetch, your Git updates them from their Git's branch names. The git switch -c name --track remote/name uses the verb in another way, which we have not covered here.

    Independent of both of these, files in your work-tree can be either tracked or untracked. A tracked file is simply one that is in Git's index right now. We have not covered Git's index properly either, but it's a very important construct and it's good to know about it.

    You can extract any file, or an entire directory-tree, from any commit. In Git 2.23 or later, use git restore to do this. In older version of Git, this function is jammed into git checkout too.

    Conclusions

    The things to keep in mind here are:

    • Git is distributed. There are many copies of some repositories, holding commits. The commits hold files, but the unit at which we interact with Git is whole commits: we either have them, or don't.

    • These repository copies mostly contain the same commits. The commits are literally shared, by copying, and are found by their hash IDs. No commit can ever change, not one bit, once it's created, so it's safe to just use your copy, if you have a copy. If you don't have a copy, you just get one, by its hash ID, from anyone who has one: they're all the same. Not only that, but the cryptography aspect of the hash-ID-as-checksum tells you that you have the right commit content: that no one has messed with it.

    • These repositories do not share their branch names. At most, sometimes someone comes along and makes sure that repository copy C1's branch name B matches that in repository copy C2. If people are diligent about this, the branch names seem to stay in sync, but that's just because people were diligent about synchronizing these independent names.

    • Your Git will remember the URL for some other Git under a remote: a short name like origin or fork or upstream or whatever you like.

    • Your own Git will remember another Git's branch names as your remote-tracking names. Running git fetch to that other Git, by its remote name, will pick up new commits (but not re-obtain old ones, so it goes fast) and update your Git's memory of their Git's branch names.

    • To do new work, you'll want to create or update your own branch names.

    • Do remember that until you use git push to send new commits to other Git repositories, they won't have your own commits. Only your own Git repository will have these. If you ask those other Git repositories for those hash IDs, they'll just say: I don't have that hash ID.