Search code examples
gitversion-controlgit-branchgit-log

how to list all 'active' branches in git containing unmerged commits


I'm struggling to make sense of the history of a couple of very large repositories that have hundreds of (old) branches which never have been deleted (even though work on most of these branches is 'done').

I'm trying to find a way to generate a list of branches that

  • contain commits after the branch was created ('not empty')
  • have not been merged into another branch

If my assumption is correct, this should return a list of branches that contain unmerged/active code - everything else is safe to delete.

A nice gimmick would be to to visualize this via git log --graph - only displaying the 'current working tree', going back only to the first commit that's present in all of the 'currently active branches'.

Any suggestions/help is highly appreciated!


Solution

  • TL;DR: git branch --no-merged HEAD is probably the answer you want. You may want to add -r or -a, or use something other than HEAD. You might want to run this (adjusted) command numerous times, once for each branch name (although in that case there are some ways to do this more efficiently, at this possible cost).

    Long

    It's important to realize here that Git does not actually merge branches. Or more precisely, we have to define what we mean by branch first (see What exactly do we mean by "branch"?); depending on which definition we use, Git doesn't have branches, or does not merge branches, or does merge branches but then they sometimes un-merge later; or there are other possibilities depending on what you mean by "branch". 😀 What Git does merge—the way that may be useful to your problem, that is—are commits. Branch names help you and Git find commits, which otherwise exist on their own in a commit graph, and that's how you will use the answer above.

    A Git repository is really, primarily, a collection of commits. Git isn't about files—though commits do contain files—and is not about branches, or at least branch names (which are well-defined, unlike "branches"), though the branch names help us find the commits. It's really just about the commits, so you need to be able to visualize commits:

    A nice gimmick would be to to visualize this via git log --graph

    You can do just that, but:

    only displaying the 'current working tree', going back only to the first commit that's present in all of the 'currently active branches'.

    The working tree isn't actually in Git, and given how poorly branches are defined in the first place, plus the fact that the word active is entirely un-defined, we'll probably never know what "currently active branches" even means. So we can't possibly do that.

    What is in Git are the commits. The commits:

    • Are numbered: each one has a unique number, or hash ID, represented in hexadecimal. Once some hash ID is allocated to some particular commit, it means that commit, forever, in every Git repository. In other words, these commit hash IDs are universally unique.1 Git gets a lot done with this principle: for instance, we hook two Git repositories up to each other, with git fetch or git push, and they exchange just the raw hash IDs and immediately know which commits (and hence files) the other Git needs to get.

    • Are immutable: no part of any commit can ever change. (This is true of all of Git's internal objects, all of which use the UUID hashing scheme. The hashing only works as long as the objects cannot change.)

    • Store two things: a snapshot of all files (in a special internal read-only de-duplicated format), and some metadata. The metadata include things like who made the commit and when, but also, crucial for Git's internal workings, a list of hash IDs of previous, or parent, commits.

    Usually the list of parents in each commit is just one element long, which gives us a simple linear backwards-looking chain of commits:

    ... <-F <-G <-H
    

    Here H stands in for the actual hash ID of the last commit in the chain. Commit H stores both a snapshot of all files (as of the state they had at the time someone made H), and some metadata. The metadata in H hold the hash ID of H's parent commit G, which stores a snapshot and some metadata; the metadata for G store the hash ID of F, which stores a snapshot and metadata; and so on, forever—or at least, until we get back to the very first commit ever, which cannot have a parent, so just doesn't:

    A--B--C--D--E--F--G--H   <-- latest
    

    We say that commit H points backwards to G, which points backwards to F, and so on. Commit A, being the first commit, doesn't point anywhere, so that allows git log to stop.

    To find H, though, we must tell Git its hash ID. To avoid having to memorize hash IDs ourselves, we have Git save this hash ID in a name, such as a branch name, latest. That name then points to H, which lets us get started.


    1We can prove, via the pigeonhole principle, that this can't actually work. Eventually it will fail. The size of the hash ID determines how soon failure becomes a distinct probability; by making it big enough, we push the failure far enough into the future that we don't care, because in Keynes' long run, we're all dead. 😀


    Now we can see how branch names work

    Suppose we have a series of commits ending at H, plus a single branch name like main:

    ...--G--H   <-- main
    

    We now add a second name, also pointing to H, so that all the commits are now on two branches:

    ...--G--H   <-- dev, main
    

    We need a way to pick out which name we're actually using. To do that, we'll have Git attach the special name HEAD to one of the branch names:

    ...--G--H   <-- dev, main (HEAD)
    

    This means we're "on" main, having done a git checkout main or git switch main, or having started out on main. Meanwhile we're using commit H. If we'd like to use the name dev instead, we run:

    git switch dev
    

    and get:

    ...--G--H   <-- dev (HEAD), main
    

    We're still using commit H, but we're using it through the name dev now.

    A brief aside on Git's index / staging-area and your working tree

    All the files in any Git commit snapshot are immutable. But we want to be able to mutate files: we can't get any actual new work done if we can't change the files. Git solves this problem like most version control systems: when we check out some commit, Git copies the files out of the commit into a work area. This work area is our working tree or work-tree.

    It's important to realize that these files are not in Git. They came out of Git, but inside Git, they are in a special, read-only, compressed (sometimes highly compressed) and de-duplicated form, that only Git itself can read and literally nothing can write. So Git copies them out, and the copies are not in Git. The copies are instead ordinary everyday files, that every program can read and write in the usual way.

    When programs do this, Git does not know that they are doing this.2 That's part of why you have to tell Git—with git add—that some file is updated.

    Other version control systems have, historically, just scanned for changes. That is, you run their equivalent of checkout and they check out some commit or file. Then you run their equivalent of checkin / commit, and they scan everything, and you go out to lunch because this step will take at least 5 minutes and perhaps an hour or more. Git doesn't do this: instead, Git keeps an extra copy of every file, but in the compressed-and-de-duplicated form. Since these extra copies just came out of a commit, they are by definition duplicates, and therefore take no space.3 This makes up most of what Git calls its index or staging area.

    When you run git add on some file, you're really telling Git: Read the working tree copy, and compress it into the internal de-duplicated form. If that turns out to be a duplicate, de-duplicate it now, so that it's prepared for the next commit. Otherwise prepare it for the next commit now. Either way, after git add, the index / staging-area copy now matches the working-tree copy, and is "staged for commit". If it matches the already-committed copy, Git doesn't say anything about it when you run git status. If not, git status says staged for commit. But in fact every file in Git's index is staged for commit: that's why this is the staging area. If Git said updated in proposed next commit, that might be better, but instead Git just says staged for commit.


    2For efficiency, it's sometimes nice to use an OS's file-monitoring facilities, and Git has some primitive ability to do this on some OSes. But for the most part Git still isn't aware of this. Git has a different efficiency trick up its sleeves (if Git can be said to have sleeves).

    3These index entries still take space to record their names and a bunch of related data, on the rough order of about 100 bytes per file.


    Making new commits

    Let's say we are in this state:

    ...--G--H   <-- dev (HEAD), main
    

    That is, we're on branch dev and using commit H. Meanwhile we've updated some files and run git add on them, so that the staged-for-commit copy doesn't match the copy in commit H. We now run git commit, and Git executes the following steps, in some order:

    • Git collects any extra metadata it needs, such as our name and email address and the current date-and-time, and a log message.
    • Git resolves the current commit to a raw hash ID (that of H) to put in as the list of parent commits.
    • Git freezes for all time the snapshot as it appears in the index.
    • Git combines all of these into a new commit, which gets a new unique hash ID; we'll call that I. Note that new commit I points back to existing commit H.
    • Here's the tricky part: Git writes the new commit's hash ID into the current branch name.

    So now we have:

              I   <-- dev (HEAD)
             /
    ...--G--H   <-- main
    

    Note that git branch did not create a branch; git commit created the branch. At least, that's what happened as long as "the branch" means the fact that commit I, now exclusively on dev, "branches off" from main.

    As we make more commits, they add on to I:

              I--J   <-- dev (HEAD)
             /
    ...--G--H   <-- main
    

    until we git switch back to main:

              I--J   <-- dev
             /
    ...--G--H   <-- main (HEAD)
    

    When we do switch commits, Git removes, from the working tree (and its index / staging-area), the files from commit J, and puts in the files from commit H instead. There's a bunch more trickiness here, but we'll ignore that.

    If we create a third name and switch to that, and add two more commits, we get this situation:

              I--J   <-- dev
             /
    ...--G--H   <-- main
             \
              K--L   <-- feature (HEAD)
    

    It's important to realize two things here:

    1. Commits up through and including H are on all branches.
    2. The name main is no longer needed in some sense: its purpose is to locate commit H. It still serves this purpose, but so do commits J and L. By starting at dev (J) and working backwards, we will reach—and hence find—commit H. The same holds for commit L. However, we do need the names dev and feature because those names are the only ways to find commits I-J and K-L respectively.4

    4If you goof this up—which is easy to do in Git—Git provides numerous ways to find the commits again, for a while. Eventually those "recover from mistake" entries, called reflogs, will expire. In what is probably a mistake, that has not been corrected in 15+ years, deleting a branch name deletes the branch's reflog, so one should be at least somewhat cautious about branch-name deletion. If Git kept these reflogs, and there's work going on that might lead to this, you could "un-delete" a branch name.


    True merges

    Once we have a branch-y structure of commits—a commit graph with a branch in it—like this one:

              I--J   <-- br1 (HEAD)
             /
    ...--G--H
             \
              K--L   <-- br2
    

    we often find it interesting and useful to use git merge. What git merge does with these, expressed as a high-level goal, is to combine work. "Work", in this case, is defined in terms of changes. Git doesn't store changes: Git stores commits. So to get changes, Git has to compare commits.

    We already see this every day with git show or git log -p. When we use these commands, Git finds a commit and uses that commit's metadata to find the commit's parent commit:

    ...--o--o--P--C--o--...
    

    To "show" commit C, Git finds its parent P, extracts both snapshots, and compares them. For every file that is the same, Git says nothing, and for every file that is different, Git figures out a recipe that will change the copy of that file in P to match the copy in C and produces that recipe.

    If work is changes, and if we have:

              I--J   <-- br1 (HEAD)
             /
    ...--G--H
             \
              K--L   <-- br2
    

    then it's intuitively obvious5 that if we compare the snapshot in H to that in J, we'll find out what work happened on br1. If we compare the snapshot in H to that in L, we'll find out what work happened on br2. Moreover, this produces two change recipes, as it were, that if applied to H, produce the snapshots in J and L respectively. If we combine the two recipes, we'll combine the work.

    That is, suppose one recipe says to modify some file, and the other doesn't mention the file at all. The combination is to take the change. If both recipes say to change a shared file, we simply combine both changes: as long as they're to different regions of the file, we can probably do that. We'll skip right over the entire mechanism here and just assume that Git can combine changes and do so correctly.6 Git applies the combined changes to the common-starting-point snapshot, from H, and makes a new merge commit M:

              I--J
             /    \
    ...--G--H      M   <-- br1 (HEAD)
             \    /
              K--L   <-- br2
    

    Commit M has a snapshot as usual: the snapshot is that built by applying the combined changes to the snapshot from H. Commit M has metadata as usual: you are the author-and-committer, its date-and-time is "now", and its default log message is the rather useless7 merge branch br2 into br1. The only thing that is different and special about M is that instead of one parent J, it has two: J and L. So when git log goes looking at what commits are "on" branch br1, Git will follow both links, and commits L and `K will be on the branch now, even though they were not, a moment ago.

    If we don't ever need to find commits K-L quickly any more, we can now delete the name br2:

              I--J
             /    \
    ...--G--H      M   <-- br1 (HEAD)
             \    /
              K--L
    

    We can still find commit L by stepping back to the second parent of M, and from there we can find K. So we might delete the name br1. If we don't, we get the problem you wrote the post about in the first place.


    5Mathematicians use this phrase to mean I don't want to prove it, and if I put it this way, you'll be too embarrassed to ask me to do that. 😀

    6As dumb as Git is—it has no knowledge of the contents of the files; it just applies simple line-by-line text rules here—this actually works surprisingly often. But this is less true for XML or JSON data; don't let Git combine XML or other structured text without careful inspection or testing, or both.

    7This is not always completely useless, but any auto-generated text is rarely going to be as good as something someone actually thinks about. Most people don't normally write good merge messages, though; you can derive useful data by looking at the two parent chains.


    Things that are not merges

    Suppose that instead of the above branch-y diagram, we have the rather simpler:

              I--J   <-- dev
             /
    ...--G--H   <-- main (HEAD)
    

    Suppose we now run git merge dev to combine work done on main vs work done on dev. The "work" we did on main will be: whatever is in commit H as compared to the files in commit H. But the files in commit H will, by definition, match the files in commit H. So there's no work done on main that isn't already on main. To that, we want to add the work done on dev, which is what we'll see as a recipe if we diff H vs J.

    Git could do this as a regular merge:

              I--J   <-- dev
             /    \
    ...--G--H------M   <-- main (HEAD)
    

    but if Git did this with the standard merge code, the snapshot in M would exactly match the snapshot in J. Commit M is in some sense not required. We do need it if we want to know that some feature was merged, but we don't need it if we just want to keep track of the commits and all the work.

    By default, Git doesn't bother doing a full merge here. Instead, git merge dev just does a git checkout or git switch to commit J, while dragging the branch name forward, like this:

              I--J   <-- dev, main (HEAD)
             /
    ...--G--H
    

    and then there is no reason not to just draw everything on one line:

    ...--G--H--I--J   <-- dev, main (HEAD)
    

    We can now safely delete the name dev, as before, leaving no trace of the merge action. If we don't, though, and make more commits on main or otherwise advance the name main, we get:

    ...--G--H--I--J   <-- dev
                   \
                    K   <-- main (HEAD)
    

    just as br2 will linger behind br1 after a true merge.

    Now we can understand git branch --merged and git branch --no-merged

    These commands needsone input: a commit. We pick some commit, like J or K or H or whatever. It then looks at all branch names, or with -r, all remote-tracking names (which I'll cover in a moment). For each such name:

    • the name selects some commit;
    • is that commit "ahead of" or "behind" the commit we picked?

    Note that can be both, as is the case with:

              I--J   <-- br1 (HEAD)
             /
    ...--G--H
             \
              K--L   <-- br2
    

    Here, commit L, found via name br2, is behind br1 or commit J because commits I and J are only on br1. But it's also ahead of br1 because commits K and L are only on br2. With:

    ...--o--P--C--o--...
    

    commit P is one step behind C and C is one step ahead of P, and there are no complications, but when there is a "branch-y" graph structure, there are these complications.

    What --no-merged does is look for any names that find any commits that are "ahead of" the selected commit. So if we select commit H, then git branch --no-merged will show us both names br1 and br2, as both names are ahead of H. But if we select commit J, git branch --no-merged will show us only the name br2, because br1 selects J, which is not ahead of J.

    What --merged does is similar, except that it shows us any names where the name selects a commit that is not ahead of the one we pick. Let's use this diagram yet again, but add the name main pointing to H, and switch to main:

              I--J   <-- br1
             /
    ...--G--H   <-- main (HEAD)
             \
              K--L   <-- br2
    

    The git branch --merged command will, if we pick main / HEAD as the commit, show us only main, because br1 and br2 are both ahead of commit H. Note that --merged counts an "even" branch as merged, and since main selects H, git branch --merged main prints main.

    If we pick commit J, though, it will show us the names main and br1, because both of those names pick a commit that is not ahead of commit J. Or, if we pick commit L, it will show us names main and br2.

    Remote-tracking names

    Git is not just a version control system. It's a distributed (and actually more important here, replicated) version control system. We make copies of repositories with git clone. Each repository contains commits, but each repository also contains these branch names that help us find commits.

    When we clone a repository, we copy all of its commits8 and none of its branch names. That is, the names in the repository we copy are private to that particular repository. We can, however, see them while our Git, working on our repository, is hooked up to their Git software that's reading their repository. So our Git takes their <name, hash-ID> pairs and stores them in our repository too, but first it changes the names.

    We give their repository a name. The standard name we use for "the" other repository (when there's only one such) is origin. That is, we run:

    git clone -o origin <url>
    

    and our Git saves the URL under the name origin. If we don't use -o, the default name is origin anyway, so we mostly don't use -o. In any case, this name—which is almost always origin, though you can change it—is something Git calls a remote. It's mainly a short name by which we can refer to their repository, instead of typing out the URL repeatedly.9 I like to refer to this as "their Git": their Git software answers at this URL, which connects their Git software to their repository, or "their Git".

    To build the names it will use to save their branch names, our Git sticks our remote name in front of their Git's branch names: their main becomes our origin/main, for instance, and their dev becomes our origin/dev. So after git clone, we have a repository with all the commits, and with all their branch names changed into these funny origin-prefixed names. These names correspond to their branch names, but they literally are not branch names: if you git checkout origin/dev your Git tells you that it's gone into "detached HEAD" mode.

    Having done all this copying, the last step of git clone is that our Git will create one branch name. We pick the branch name with -b: git clone -b dev url for instance. If we don't pick a name with -b, our Git will ask their Git what they recommend, which is usually master or main, and then our Git creates that name.

    What all this means is that we end up with a repository with all their commits (but see footnote 8) and one branch. Their branches have become our remote-tracking names. Our one branch, that git clone created as its last step, points to the same commit as one of their branch names, and that's the branch we have checked out right now.

    To update our remote-tracking names, we run git fetch:

    git fetch origin
    

    This tells our Git to look up the name origin, convert it to a URL, contact the Git software there, and have them list out their branch names and hash IDs. Our Git can immediately tell, from the hash IDs, whether we have all of their commits, or need to get some commits from them. If we need commits, our Git converses with their Git to make a more complete list, then gets their new commits and stuffs those into our repository: because these are the same commits, they have the same hash IDs. Now we have all their commits, plus any commits we had before that they didn't have.

    Having obtained from them any new commits they have that we need, our Git now updates our remote-tracking names to remember which commits their branch names remember. And then we're done fetching and our Git disconnects from their Git.

    (If we want to send them commits that we have that they don't, we use git push. This is almost a mirror image of git fetch, with one really huge exception: they don't have any remote-tracking names for us. We ask them to create or set one of their branch names, after we send them new commits. But we'll skip over all of this here.)


    8This is a bit of an overstatement: we copy the reachable commits, and we can deliberately limit how many of those we copy too. But the default is to copy all reachable commits, and people generally don't worry about nominally-removed, still-findable-by-reflog commits, so saying "all commits" is a good way to think of it, as long as you remember that there's a footnote.

    9In primeval Git, you really did have to type out the URL each time. This was pretty error-prone and the Git folks invented a bunch of different hacks to get around it. The one that really stuck, in the end, was this idea of a remote, origin.


    Conclusion

    The git branch command is the user-facing (or porcelain) command that iterates over branch names, or things that look like branch names such as remote-tracking names. It also lets us create and delete branch names, though that's not what we're concerned with here.

    Using --merged or --no-merged, we can pick out one commit in our repository, and ask which names—branch and/or remote-tracking names—in our repository point to specific commits that are either not ahead of (--merged) or are ahead of (--no-merged) the one commit we picked out. Because of the nature of the commit graph and the way branch names work, that usually gets us what we want here.

    (Note that a so-called squash merge, which we did not cover above, is not a merge at all, so this does not work if someone has been using squash merge.)