Search code examples
gitterminology

Understanding the output of git status with the short flag


$ git status -s
 M README
MM Rakefile
A  lib/git.rb
M  lib/simplegit.rb
?? LICENSE.txt

There are two columns to the output - the left-hand column indicates the status of the staging area and the right-hand column indicates the status of the working tree. So for example in that output, the README file is modified in the working directory but not yet staged, while the lib/simplegit.rb file is modified and staged. The Rakefile was modified, staged and then modified again, so there are changes to it that are both staged and unstaged.

The above is from Pro Git by Scott Chacon and Ben Straub and published by Apress.

I'm confused about the distinction between staging area and working tree. And I'll explain what I believe is to be true.

"the README file is modified in the working directory but not yet staged": We're not tracking this file. Still, Git understands that it has been modified. From the last snapshot.

"the lib/simplegit.rb file is modified and staged": after modification we've staged the file. All that is left is the commit.

"The Rakefile was modified, staged and then modified again, so there are changes to it that are both staged and unstaged.": Just like the previous file we have staged a modified file. What's next?


Solution

  • "the README file is modified in the working directory but not yet staged": We're not tracking this file. Still, Git understands that it has been modified. From the last snapshot.

    No, this is wrong: specifically, the file is staged (and the staged copy matches the HEAD copy, and this makes the file tracked). The tricky part with the staging-area is that it is normally mostly invisible. This leads people down the wrong path in trying to understand how it works.

    First, let's tackle some Git terminology. There are three entities of interest at this point: the current commit, the staging-area—which actually has three names—and the work-tree. The three names for the staging-area are index, staging-area, and cache, and these three names reflect the low quality of Linus Torvald's original choice ("index") or the enormous importance of the invisible staging-area, or both. (I think both.) Let's look deeper at each one:

    • The current commit, which we can also name via the name HEAD (in all capitals1), is of course a commit—it's a snapshot of all the files that were in the staging area when you (or whoever) ran git commit. This snapshot is permanent (mostly) and read-only (entirely). Its true name is not HEAD—that's just a symbolic name by which we can find it right now—but rather some big ugly hash ID. The hash ID appears random, but is in fact a cryptographic checksum of the complete contents of the commit. That's why the commit can't be changed—changing anything would change the checksum, resulting in a different commit.

      The files stored within2 the commit are also read-only. They are stored in a special, Git-only, compressed form. This particular compression has the nice property that if the contents of a file are the same from one commit to another, these commits all share the underlying compressed file-image. That means you can commit a big file millions of times, if you like, and not use any more space than committing that file once.

    • The index / staging-area / cache is this crazy almost-invisible data structure. It contains all the files at all times, in the same way that a commit contains all the files. The files in the index are also in this special compressed Git-only format. The key difference between a file copy in the index / staging-area, and a copy in a commit, is that the index one can be overwritten.

      (The index also caches—hence the name "cache"—information about the work-tree, to make Git go faster. These two facts, that the index holds all the files all ready to go into the next commit, and that it caches stuff about the work-tree, are what make git commit so insanely fast, compared to other similar version control systems.)

    • The work-tree is the simplest of the three, but in a sense, also the one Git cares the least about. It's where you do your work on your files. These files are in the ordinary format that the rest of your computer programs understand. They are the most important to you, but the least important to Git: a --bare repository has no work-tree, but Git can still function (in a more limited way of course).

    The work-tree is the only one of these three things that you can see easily and directly. Simply use whatever command it is that lists files or views files: there they are, plain to see. Fortunately, commits are easy to see as well, by checking them out.

    When you initially check out some particular commit—via git checkout master or git checkout develop, for instance—Git populates both your index / staging-area and you work-tree from that commit. It sets HEAD to be a symbolic name for the correct hash ID. That way, the index already has in it all the same files that are in the HEAD commit, and the work-tree has all the same files that are in the index.

    If you modify a file in the work-tree, and then run git add on it, Git copies the work-tree version of that file into the index / staging-area. Now the HEAD commit version and the index version differ, but the index version and the work-tree version agree with each other.

    If you modify a file in the work-tree but don't run git add on it, the HEAD and index versions agree, but the index version disagrees with the work-tree version.

    If you modify a file in the work-tree, then (1) use git add to copy it to the index / staging-area and (2) modify it again, now all three versions of that file differ. This is where you will see an MM status.

    What git status is doing is, in effect, running two diffs. The first one compares HEAD to the index. Whatever is different here is "staged for commit". The second diff compares the index to the work-tree. Whatever is different here is "not staged for commit". That's almost it—we're nearly done!

    Last, let's take a look at the term tracked as applied to files. In Git, a file is tracked if and only if it is in the index / staging-area. It's really that simple! The tricky part is telling whether a file is in fact in the index, since it's normally so invisible there.

    The git status command compares the index: first, it compares HEAD vs index. Suppose some file is in both HEAD and index and has the same contents in both. Then you won't see it here. Likewise, you won't see it here if it's the same in the index and the work-tree. So if the file is in the index, but matches both HEAD and work-tree versions, it's invisible.

    Suppose some file isn't in the index. If it's in HEAD, git status will tell you that between HEAD and the index, the file got deleted—a D in the first column of the short output. So in that case you can tell: the file has gone away from the index, and is no longer tracked. It won't be in the next commit.

    Suppose some file isn't in HEAD, but is in the index. In this case git status will tell you that between HEAD and the index, the file got added—an A in the first column of the short output. So in that case you can tell that the file is now tracked, and will be in the next commit.

    The tricky case occurs when a file is both untracked and ignored, because now, if the file is not in the HEAD commit (and by definition it's not in the index—we just said it was untracked), the first column can't tell you anything: it's not in either of those two entities, so Git says nothing here. The second column could tell you that the index and work-tree don't match, if the file exists in the work-tree, but since you told Git that the untracked work-tree file should be ignored, git status won't mention it here either.

    Finally, there are a few things worth mentioning:

    • You can actually view the index. Run git ls-files --stage to see a quick view of most of what is actually in the staging-area. This is impractical in a big project, precisely because the staging-area holds a copy of every file—well, every file that will be committed. That can be tens of thousands of files. It's much more useful to view the difference between the HEAD commit and the index / staging-area, so that's what git status does (in the first column of --short output).

    • You can also view the contents of a commit directly. Run git ls-tree -r HEAD to see all of the committed files. The output is similar to git ls-files --stage. (It adds the Git object type name and takes away the staging number, and uses a tree structure rather than the index's flattened-tree.) As with git ls-files --stage this is mainly useful for debugging Git or writing fancy new commands, not for regular work.

    The key here is that git status summarizes the state of the three entities of interest, by comparing HEAD to the index, and then comparing the index to the work-tree. The two columns show you the differences between them, stripped down to just a letter code and a file name. Even though the next commit will be a snapshot of every file that is in the index / staging-area at that time, it's more useful to tell you what's different about that snapshot, as compared to the current snapshot, or the potential snapshot you could make by copying work-tree files into the index.


    1On Windows and MacOS where opening a file named readme.txt opens an existing file named README.TXT (and vice versa), you can use lowercase, but Git has various places where it hard-codes the all-capitals HEAD string, so it's best to stick with that. If you don't like typing that much, the character @ is a synonym for HEAD.

    2Technically, a commit stores the hash ID of a tree object. The tree object stores each file's name, mode (100644 or 100755), and content-hash-ID, along with names and hash IDs for subtrees as needed. Hence the file contents are not actually inside the commit, but rather laid out as blob objects, right alongside commit and tree objects. This is the mechanism by which commits—and the index!—share blob objects so that however many snapshots you have of a big file, you really only have one copy in the repository database.