Search code examples
gitfile-rename

Git renamed files and inodes


Consider that we apply the following commands, to a file (hello.txt) tracked under git (in a clean working copy):

echo "hi" >> hello.txt
mv hello.txt bye.txt
git rm hello.txt
git add bye.txt
git status

Result:

On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

    renamed:    hello.txt -> bye.txt

So, git knows it is the same file, even though it was renamed. I had some vague memory, that git checks inodes to determine that the new file is the same as the old deleted files. This and this SO answer, though, suggest that git only checks the contents of the file, and does not in any way check that it is the same inode. (My conclusion (*): if I did bigger modifications to the file, git would not detect the rename, even though the inode was still the same.)

It seemed to me thus quite obvious, that I was wrong, and git does not check inodes (or any other file-system info at all), just the contents. But then, I found this other answer, which claims that

In addition to the timestamp, it [i.e., git] records the size, inode, and other information from lstat to reduce the chance of a false positive. When you perform git-status, it simply calls lstat on every file in the working tree and compares the metadata in order to quickly determine which files are unchanged.

I actually have two questions about this:

  1. Is my understanding below correct?

Git does rely (also) on inodes to detect if a file was changed, but it does NOT use inodes to detect file renames.

  1. Assuming 1. is correct. Why does git not rely on inodes to detect file renames? If it did, then we would not have the problem above, marked with (*). (I.e., it would detect the rename, no matter how big the content change.)

(I imagine that the answer is something like "so that the behaviour is the same on system that don't have inodes, e.g. Windows". However, if that's the case, then this "same behaviour" was already broken by relying on inodes for detecting changes.)


Solution

  • The full answer is complicated, but there's no cause for concern here. There is one real problem, which I'll get to at the end, but it has nothing to do with inodes.

    Let's start with a side trip to discuss—as briefly as I can and still remain stand-alone—Git's HEAD, index, and work-tree. Let's look briefly at the file/object storage model as well. Then, let's talk about git diff, and then about git status. Then we'll be ready to look at how the index works as a cache, and where inodes come in. Last, we'll be ready to see how the real problem occurs.

    Up here, though, I'll insert this summary: Normally, this is all completely invisible. The cached data are correct and the second git diff that git status runs goes fast. Or, the cached data are out of date, Git notices that the cached data are out of date, and the second git diff goes slower and—as a side effect—updates whatever cached data it can, so that another git diff run by another git status will go fast. So, normally, you don't have to care about any of this.


    HEAD, the index, and the work-tree

    The work-tree is, of course, simply a tree of files in their ordinary (non-Git) format, where you and all the code on your computer can work with them. Initially, you clone a repository and/or run git checkout branch and your work-tree is now filled with the files that correspond to some branch tip, such as master or branch. You can also run git checkout hash or similar to get what GIt calls a "detached HEAD"; in this case the current commit is some historical commit, but as before, your work-tree is filled with the files that correspond to that commit. (There are some exceptions to this rule: for instance, you can have untracked files; and see Checkout another branch when there are uncommitted changes on the current branch.)

    The HEAD commit is, by definition, the current commit. As with every other commit, this commit is read-only; it has some metadata (author and committer, parent commit hash, and commit message); and it stores a tree object hash ID, by which it stores (indirectly) a complete snapshot of files. Since this is the current commit, it's also—initially at least, and there are various special cases that can interfere here—what you will see in your work-tree. Note that all the files in the current commit are not just read-only, like everything inside the object database; they are also in a special Git-only format. Few if any non-Git commands can read these files at all.

    Between the HEAD and work-tree, though, there's a point where Git deviates rather radically from other version control systems like Mercurial and Subversion. Git exposes—and in fact forces you to know about—Git's index, also called the staging-area and the cache. This index really does, at least figuratively, stand right between the HEAD and the work-tree. The HEAD (current commit) contains a snapshot of files in a special Git-only form. The work-tree contains all your files in ordinary form. If we put HEAD on the left and the work-tree on the right, the index occupies the space in between. If you're in a new repository with just a README file committed, you might have this rather silly looking situation:

     HEAD     index     w.tree
    ------    ------    ------
    README    README    README
    

    The README in HEAD is read-only. It's in special Git form. You can't change it.

    The README in the index is also in special Git form, but it's read/write: you can change it. You can't actually use it at all though, because it's in that special Git-only form.

    The README in your work-tree is in ordinary (non-Git) form. It's read/write: you can do whatever you want with it. Git can't use it yet though, because it's not in the special Git-only form.

    The full purpose of the index is complicated, but the short version of it—before we get into inodes at all—is that it's where you build the next commit you will make. If you want to change the README, or add a new file, you can first make the change in your work-tree. Let's say you change README and create a new (as yet untracked) a.txt:

     HEAD     index     w.tree
    ------    ------    ------
    README-   README-   README+
                        a.txt
    

    For the purpose of this diagram I've labeled the two variants of README with - (the old one) and + (the new one). The new, modified README is only in your work-tree.

    If you now were to run git add README, this would copy the work-tree README into the special Git-only format, and put that into the index. If, instead, you run git add a.txt, that will copy the work-tree a.txt into the special Git-only format and put that into the index. The end result is:

     HEAD     index     w.tree
    ------    ------    ------
    README-   README-   README+
              a.txt     a.txt
    

    If you now run git commit—without first running git add README—Git will now make a new commit from whatever is in the index right now. That's the old README and the new a.txt. This new commit becomes the current (HEAD) commit, so now we have:

     HEAD     index     w.tree
    ------    ------    ------
    README-   README-   README+
    a.txt     a.txt     a.txt
    

    If you now run git add README, the index will get the new version of README; committing that will make a new HEAD commit with the new README so that everything matches:

     HEAD     index     w.tree
    ------    ------    ------
    README    README    README
    a.txt     a.txt     a.txt
    

    In each case, git commit just takes whatever is in the index right then and turns it into a frozen, read-only snapshot for the new commit. Since the files are already in the special Git-only format, this goes very fast. That's one of the tricks Git uses to get its speed: the slow part, converting from plain format to special compressed Git format, happens during git add, not during git commit. If you have millions of files, but only modified two or three, Git never has to re-compress all the millions of files.

    File and object storage

    Let's look at the way that Git stores commits and files, which Git calls blobs, and its other two intermediate object types, which Git calls trees and annotated tags. There are multiple levels of compression that Git can use on these data, but we won't go into any of that; we'll just look at how Git uses hash IDs.

    What Git does with all four of these things—which Git calls objects—is to reduce them all to a cryptographic checksum (currently SHA-1 but moving to a new checksum eventually). Git prepends the object type—commit, tree, blob, or tag and the size in bytes, and calculates the hash. The result is guaranteed to be unique (see also How does the newly found sha1 collision affect git?). Git uses this as the key in a key-value store to stuff the (compressed) data into the repository database. Git can thus extract the object data quickly given the key.

    What this means for us is that within a commit (as identified by its unique hash ID), each file is really stored as just a <name, ID> pair. (More correctly, it's a <mode, name, ID> triple. This is also true within the index, although there, Git stores even more data.) This makes it really easy to tell if a file is completely unchanged: if it is, it has the same hash ID, because the same input data always reduce to the same hash ID.

    Since the actual contents are in the key-value store under the ID, the commit can just list the ID. If thousands of commits list README or a.txt with the same ID, the actual file is stored only once, under the ID; each commit stores just the ID.

    Of course, if one commit has one version of README with one ID, and another commit has a different version of README, the two commits will have two different IDs for the file named README.

    git diff and rename detection

    There are a lot of nitty details about git diff—some of which will hit us in just a moment—but let's ignore them for now and concentrate instead on how git diff works when you give it two particular commits. Git can look up both commits, obtain their stored snapshot trees, and compare IDs. Any IDs that match mean the files match, so git diff only has to look at files that have different IDs. This is an enormous time-saver.

    Suppose we ask Git to compare commit/tree L (left) vs commit/tree R (right), and every file except for README has the same ID. That is, L's a.txt has ID 12345... and its b.dat has ID 6789a..., but L's README is ccccc.... R's a.txt also is 12345... and its b.dat is also 6789a..., but R's README is eeeee.... Git is only really going to have to extract the two README blobs (files ccccc... and eeeee...) and compare those two blobs to produce context diffs.

    Now suppose that we have Git compare two trees, and everything is the same between L and R except that L has a file named README and R has a file named README.md. Was the file renamed? It could have been! Git can, first, compare the two hashes. If they match exactly, the file was certainly renamed. If they don't match exactly, Git can extract the two blobs and compare them for similarity. If they seem pretty similar (say, 97% similar), Git can assume the file was renamed.

    That, in a nutshell, is how git diff does rename detection: take the tree on the left L and the tree on the right R. All the files that exist in both L and R are either "the same" or "modified". Files that were in L but aren't in R, can maybe be matched up with files that are only in R. First do a fast check of their hashes and pair up exact matches. Then, do a similarity scan on everything that's left, and pair up those that are sufficiently similar: they were renamed (and maybe modified slightly too). Any remaining files that are gone from L or new in R were deleted or newly-added.

    Making git diff fast is a problem with the work-tree

    The scheme outlined above works great with actual commits, because files inside commits are in that special, Git-only form. It even works with the index, because files in the index are also in special, Git-only form: they've already been reduced to hash IDs. The index, in this case, acts like a flattened tree. The work-tree, alas, is not in the special, Git-only form. We'll come back to this soon, because....

    The git status command just runs two git diffs

    When you run git status, Git runs two internal diffs. The first one compares HEAD vs the index. This is very fast for the reason we saw above: everything is already in this ideal format, with files reduced to unique hash IDs. Git can scan HEAD as L and the index as R, and compute the diff very quickly. (Since we don't care about the changes themselves—just about which files are the same, which are renamed, and which are modified—Git can omit the slowest part of most such diffs, which is computing the context diff to print.)

    Alas, the second diff is much slower: Git must compare the index vs the work-tree. The work-tree is not in special Git-only format. Git could make a second, temporary index and add everything to it, but this would be very slow, so it doesn't do that. To make this diff much faster, Git secretly adds cache data to the index, and this is where the inodes come in. The inode numbers are part of this cached data. But this is (normally, at least; see below) just a speed hack. If the inode numbers change, git status is simply slower.

    The index as cache

    In those earlier diagrams showing HEAD, the index, and the work-tree, notice how it was so common to have all three files exactly the same, or—once we modify a file in the work-tree and then git add it—to have the index match the work-tree. What if there were some way that Git could know, quickly, whether a work-tree file had been changed since an earlier time when Git looked very closely at the work-tree file knew for sure that it was, or wasn't, exactly the same as the index version?

    It turns out that while there is no perfect method for this, there's a method that's good enough (at least in most people's valuations). Git can use the OS's lstat system call on each work-tree file and save, in the index, some of the data from the call (part but not all of ctime, mtime, ino, mode, uid, gid, and size, per the index format documentation in the technical notes). If the data in a later lstat call match that from an earlier one, the work-tree file is assumed to have the same in-file data as before.

    The exact usefulness of this data are a bit tricky. Some of the stored data are used to decide whether a work-tree file is "clean", i.e., matches the version in the index. There is a one-second granularity issue and a race condition where Git may have to assume, temporarily, that a work-tree file is not clean, and then do an expensive clean operation on the file to find out whether it's really clean or not. Note, however, that the general case is that Git simply does extra work, i.e., slows down to check whether a file that is clean should be considered clean. It does not cause Git to consider a file clean when it's actually dirty. The one case that could fool the detector here occurs when you manage to set both the mtime and ctime back while keeping the (low 32 bits of the) size the same, but doing so generally requires re-setting the computer's clock as well.1


    1This is because the system calls that alter the mtime to any value you choose, all set the ctime to "now" where "now" is taken from the system clock. Hence, to set the mtime to (e.g.) yesterday while also setting the ctime to yesterday, you must first set the system itself to yesterday.


    The one real problem

    There's a more significant problem, though, that really does show up in real repositories. Suppose the index's cache attributes tell you that a work-tree file is clean, i.e., the work-tree version matches the index version of the file. Suppose also that you are making use of .gitattributes with clean and smudge filters, or with end of line conversions. In this case, copying a file from the index to the work-tree applies the smudge filter:

    read-from-index :0:$path | $smudge > $path
    

    (where read-from-index is a somewhat hypothetical program that is actually implemented by git cat-file -p, $smudge is your filter for this file, and $path is the path name you want for the file—the :0: is the special syntax Git uses for "index slot zero").

    Meanwhile, copying files from the work-tree to the index applies the clean filter:

    $clean < $path | write-to-index $path
    

    (where write-to-index can be written using git update-index; you also need to supply the mode and stage number).

    The problem comes in two parts:

    • the filters chosen for $clean and $smudge depend on end of line conversion selection, .gitattributes contents, and your configuration; and
    • the actions taken by $clean and $smudge are not under Git's control.

    If Git determines that a file is "clean" based on its stat and index data, but you change which $clean filter is applied, or what $clean does, then re-cleaning the file and writing the result to the index would produce different index data. In other words, even though the index's cache attributes proclaim that the file is clean, it's actually dirty.

    Where this typically shows up is when you add line-ending changes to your configuration and/or edit .gitattributes to change which files get line-ending changes applied. Note that if you never have Git touch line endings, this is never a problem.

    There are two remedies, one that works en-masse by removing and recreating the index, and a simpler one:

    • If you know you have not staged any files, you can remove the index file (.git/index) and run git reset (which does a --mixed reset, re-creating the index from HEAD). If you have staged files and hit this problem, you can still use this remedy, you just need to re-stage. If you've carefully staged parts of some files you don't want to use this method, but you can use the simpler one-file-at-a-time remedy.

    • If you just want to force Git to consider some file $path as dirty, update its modification time to "now", e.g.:

      $ touch $path
      

      Now the file is marked dirty and Git will be forced to run whatever the currently-defined cleaning process is, before seeing whether the file is clean.