Search code examples
gitapachegithubbranchcommit

Why do some commits belong to no branch?


I have encountered that some commits do not belong to any branch in Git repositories. For example, the following commit is tagged as release for Apache Commons CSV but it does not belong to any branch:

https://github.com/apache/commons-csv/commit/0fbd1af5e3bd70454d5e398493a5c983aead2b67

Its parent commit belongs to master.

https://github.com/apache/commons-csv/commit/7688fbc3f9f5acf73d3c5018dd83310f7580d02e

Is it possible for you to help me understand this?


Solution

  • This situation is normal enough in Git, which uses branches in a very different way from most traditional version control systems (VCSes). There is in fact a fairly deep philosophical question hidden here: see What exactly do we mean by "branch"?

    Branch names identify tip commits

    In most VCSes, the name of a branch is important, perhaps even the most important thing about the branch. This is not true in Git: branch names, in Git, have very little value (to Git itself anyway). To Git, what matters are commits. Commits are permanent—well, mostly permanent—and immutable: once made, no commit can ever be changed. But the true name of each commit is a terrible, unweildy, unpronounceable, impossible to remember string of digits and letters, such as fe0a9eaf31dd0c349ae4308498c33a5c3794b293. These are not good for human beings, so Git lets us use names to stand in for these raw hash IDs.

    Another important thing about each commit, though, is that any one commit stores the true name—the hash ID—of another commit, which we call the commit's parent or predecessor. We say that this child commit points to its parent.1 If we take a string of unpronounceable hash IDs and put them in the "most grandparent-y" to "most child-y" order, we get something like:

    ... <-26e4... <-8b02... <-fe0a...
    

    The most-child-like of these commits gets the branch name, and the name then points to the last commit:

    ... <-26e4... <-8b02... <-fe0a...   <--master
    

    Git uses that last (or tip) commit to find its parent, and then uses the parent to find the grandparent, and so on, throughout all of the repository. But because the hash IDs look random—and are deliberately almost impossible to predict—even Git itself wants to have a name by which it can find the last commit in the chain. That hash ID is especially important since Git uses that commit to find the rest of the commits. This gives us a picture like this one:

              o--o   <-- branch1
             /
    ...--o--o
             \
              o--o--o   <-- branch2
    

    (where I've simply stopped drawing the internal backwards direction of the arrows, and replaced the hash IDs with round dots for each commit).

    The commits in the middle row are a bit puzzling, though: which branch are they on? Git's answer is that they are on both branches. Instead of a commit belonging to the branch on which the commit is first made, a Git commit belongs to every branch—well, every branch name—that leads back to it.

    To add a new commit to some branch, you git checkout the branch, work as usual, git add as appropriate, and run git commit. This writes out a new commit that points back to the current commit as its parent:

                   o   (new!)
                  /
              o--o   <-- branch1 (HEAD)
             /
    ...--o--o
             \
              o--o--o   <-- branch2
    

    Then, whatever commit hash ID gets assigned to the new commit, Git writes that hash ID into the branch name. To know which name to update, Git attaches your HEAD to one of the branch names. Once the new commit's hash is safely stored, we can draw the updated picture as:

              o--o--o   <-- branch1 (HEAD)
             /
    ...--o--o
             \
              o--o--o   <-- branch2
    

    and this is one of the normal ways that branches grow.


    1The child remembers the parent, rather than the other way around. Since commits are immutable, this is necessary. Just as with human parents and children, the parent exists when the child gets created, but the child does not exist yet when the parent gets created. Since commits can only remember the past, the parents cannot recall their children.


    Tags also identify commits

    A tag name, like a branch name, simply points directly to a commit. Unlike a branch name, though, Git won't automatically change a tag name to make it point to any other commit. In fact, you shouldn't do this either, in general—not that it will break your own Git, but it can break other people's expectations about your Git repository. Once they have a tag-name-to-hash-ID mapping, they may think that they have the right hash ID from then on, because tags are not intended to move like branch names. Hence if we tag some commit:

              o--o--o   <-- branch1
             /
    ...--o--o
             \
              o--o--o   <-- branch2 (HEAD)
                    ^
                    |
                 tag:v1.2
    

    and then add another commit:

              o--o--o   <-- branch1
             /
    ...--o--o
             \
              o--o--o--o   <-- branch2 (HEAD)
                    ^
                    |
                 tag:v1.2
    

    the tag remains in place.

    Names can be deleted at any time

    If we decide that branch2 is a bad idea, we can git checkout branch1 and then delete the name branch2. Without the name branch2, the final commit we just added is no longer find-able:

              o--o--o   <-- branch1
             /
    ...--o--o
             \
              o--o--o--o   ???
                    ^
                    |
                 tag:v1.2
    

    The tag name v1.2, however, is still around, and it makes the tagged commit find-able. That tagged commit is on no branches (and in this drawing, neither are its parent or grandparent, though its great-grand-parent is still on branch1).

    Names protect commits

    I mentioned above that commits are mostly permanent. That last commit, which no longer has a name, is now unprotected. Git has a device called the garbage collector that acts as a sort of Grim Reaper to remove leftover, unwanted stuff. This Grim Collector, git gc, searches the entire Git database for all commits, while also using all names to find all commits. Commits that can be found via some name—any name, including a tag name—are marked to be kept. Commits (and other Git objects) that are not findable this way, that are unreachable from named commits, get collected and destroyed.

    This process lets Git generate objects freely, and only decide to use them for real at the last minute. It lets you move branch names around at any time as well. As long as commits are protected by a name, they stick around. Once there is no name for them, they become available for garbage-collection. This is how you (and Git) get rid of unwanted commits. Commands like git stash work by creating commits that are on no branch, but are protected by the refs/stash name (or its reflog, which I won't go into here). Dropping a stash drops its name; eventually git gc removes it for real.

    The tag protects the tagged commit, and any earlier (parent) commit, just like a branch name would. If you remove the tag, the now-unnamed commit becomes vulnerable to git gc. But until then, it can happily stick around even though it's on no branch at all.

    Note that for GitHub-specific-and-internal reasons, GitHub currently never garbage-collect a commit by default, even if Git would have dropped it by now. So if you know the commit's hash ID, and it ever existed in some repository on GitHub, you can still access it on GitHub, in that repository, through its hash ID. If you have a commit that contains a file with sensitive data, you can ask the GitHub operations folks to clear it out manually (although by the time they receive the email the data have probably escaped—there are "scraper bots" that look for this stuff!—hence the advice to change passwords immediately upon finding this sort of problem).