git version-control reverse-engineering git-add

How does `git add` deal with changes like file<->directory?

This is a long question. I'm trying to reverse-engineer some basic Git functionalities, and am having some trouble wrapping my head around what git add really does under the hood. I'm already familiar with the three trees of Git, and that the index file is not really a tree but rather a sorted-array representation of the tree.

My original hypothesis is as follows: when git add <pathspec> is run,

If <pathspec> exists in working directory:
1. Create an index file from that reflects state of in working directory
2. Overwrite the relevant section of index file with this (sub-)index.
If <pathspec> exists only in current index file:
1. This means has been deleted in working directory, so...
2. Delete relevant section of index file that corresponds to .
If <pathspec> does not exist in working directory or index file:
1. fatal: pathspec <...> did not match any files

This hypothesis reflects a "do what you're told to do" git add, that only looks at the path and registers changes at or under this path to the index file. For most cases, this is how the actual git add seems to work.

But there are some cases that don't seem very straightforward:

1. Replacing a file with a directory

git init

touch somefile
git add . && git commit

rm somefile
mkdir somefile && touch somefile/file

At this point, the index file consists of only a single entry for the somefile file I just deleted, as expected. Now I execute git add. I have two ways of doing this: git add somefile or git add somefile/file. (Obviously I'm excluding the trivial git add . here)

What I expected:

git add somefile: equivalent to git add . - remove old entry and add new entry
git add somefile/file: only add an index entry for the new somefile/file.

What actually happens: Either of the above commands directly lead to the final state of having a single index entry for somefile/file - ie, both are equivalent to git add ..

Here, it feels like git add is not your straightforward "do what you're told to do" command. git add somefile/file seems to peek in and around the provided path, realizes somefile is no longer there and automatically removes the index entry.

2. Replacing a directory with a file

git init

mkdir somefile && touch somefile/file
git add . && git commit

rm -r somefile && touch somefile

At this point, the index file contains a single entry for the old somefile/file as expected. Again, I execute git add in the same two variants.

What I expected:

git add somefile/file: Normally, remove entry for the old somefile/file. But if it peeks around, it should also add new entry for somefile.
git add somefile: equivalent to git add ..

What actually happens:

git add somefile/file: leads to an empty index file - so, it does what I normally expect it to do!
git add somefile: equivalent to git add ., as expected

Here, git add behaves as a "do what you're told to do" command. It only picks up the paths and overwrites the appropriate section of index file with what the working directory reflects. git add somefile/file does not poke around and thus does not automatically add an index entry for somefile.

3. Inconsistent index file

Up to this point, a possible theory could be that git add tries to avoid the case of an inconsistent index file - ie, an index file that does not represent a valid work tree. But one extra level of nesting leads to exactly that.

git init

touch file1
git add . && git commit

rm file1 && mkdir file1 && mkdir file1/subdir
touch file1/subdir/something
git add file1/subdir/something

This is similar to case 1, only that the directory here has an extra level of nesting. At this point, the index file consists only of an entry for the old file1 as expected. Again, now we run git add but with three variants: git add file1, git add file1/subdir and git add file1/subdir/something.

What I expected:

git add file1: Equivalent to git add ., leads to single index entry for file1/subdir/something.
git add file1/subdir and git add file1/subdir/something: Normally, should only add an entry for file1/subdir/something (leading to inconsistent index file). But if the above "no-inconsistent-index" theory is correct, these should also remove the old file1 index entry, thus being equivalent to git add ..

What actually happens:

git add file1: Works as expected, equivalent to git add ..
git add file1/subdir and git add file1/subdir/something: Only add a single entry for file1/subdir/something, leading to an inconsistent index file that cannot be committed.

The inconsistent index file I'm referring to is:

100644 <object addr> 0  file1
100644 <object addr> 0  file1/subdir/something

So just adding another level of nesting seems to stop git add from peeking around as it did in case 1! Note that the path provided to git add didn't matter too - both file1/subdir and file1/subdir/something lead to inconsistent index file.

The above cases paint a very complicated implementation of git add. Am I missing something here, or is git add really not as simple as it seems?

Solution

Actually, this just means you have found a bug in (at least some versions of) Git.

Git understands that OSes cannot support two entities, one being a file and another being a directory/folder, with the same name. That is, we cannot have both file1 being a file and file1 being a directory.¹

Now, the thing about Git's index is that it has no ability to hold directories in it at all.² The only allowed entities are files. So either file1 exists, or else file1/subdir/something exists, but never both. Git has a bunch of rather complicated code inside it, for both the index itself and for handling of OS-level files during git checkout, git reset, and the like, that is supposed to take care of "D/F" (directory/file) conflicts. Git needs to be able to handle these when doing a git checkout of a commit where somefile is a file, then git checkout of a different commit where somefile/file is a file so somefile must be removed and a directory must be inserted. It needs to be able to handle the checkout where we go back to the first situation, so that somefile/file must be removed, then somefile/ must be rmdir-ed, then somefile can be created as a file. And, it must handle merges where somefile was a file in one or two of the three commits but somefile/file exists in the other two or one commits.

Apparently, someone missed a corner case. I was able to reproduce this myself, using your steps, and:

$ git ls-files --stage
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       file1
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       file1/subdir/something
$ git write-tree
You have both file1 and file1/subdir/something
fatal: git-write-tree: error building trees

This state is not supposed to exist. It's the addition of file1-as-a-directory that erases the index slot containing file1:

$ git add file1
$ git ls-files --stage
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       file1/subdir/something

as this triggers the code that strips out the now-undesirable entry.

(It's pretty clear that this needs a fix and a test-suite test case. Fortunately Git self-detects the bad case during the tree-build process, so that it does not make bad commits.)

¹I think perhaps we should be able to do this, but it's currently forbidden by POSIX rules and none of the Unix-like file systems support it. It would make a mess of archivers like tar, too.

²This is not quite strictly true: for various speedup purposes, the index holds "irregular" (non-cache) entries as well as the normal cache entries that describe the proposed next commit. It's the cache entries that do not hold directory existence; the entries that aren't stuff-to-be-committed can hold all kinds of auxiliary information. But none of these are shown by git ls-files.