This is a long question. I'm trying to reverse-engineer some basic Git functionalities, and am having some trouble wrapping my head around what git add
really does under the hood. I'm already familiar with the three trees of Git, and that the index file is not really a tree but rather a sorted-array representation of the tree.
My original hypothesis is as follows: when git add <pathspec>
is run,
<pathspec>
exists in working directory:
<pathspec>
exists only in current index file:
<pathspec>
does not exist in working directory or index file:
fatal: pathspec <...> did not match any files
This hypothesis reflects a "do what you're told to do" git add
, that only looks at the path and registers changes at or under this path to the index file. For most cases, this is how the actual git add
seems to work.
But there are some cases that don't seem very straightforward:
git init
touch somefile
git add . && git commit
rm somefile
mkdir somefile && touch somefile/file
At this point, the index file consists of only a single entry for the somefile
file I just deleted, as expected. Now I execute git add
. I have two ways of doing this: git add somefile
or git add somefile/file
. (Obviously I'm excluding the trivial git add .
here)
What I expected:
git add somefile
: equivalent to git add .
- remove old entry and add new entrygit add somefile/file
: only add an index entry for the new somefile/file
.What actually happens: Either of the above commands directly lead to the final state of having a single index entry for somefile/file
- ie, both are equivalent to git add .
.
Here, it feels like git add
is not your straightforward "do what you're told to do" command. git add somefile/file
seems to peek in and around the provided path, realizes somefile
is no longer there and automatically removes the index entry.
git init
mkdir somefile && touch somefile/file
git add . && git commit
rm -r somefile && touch somefile
At this point, the index file contains a single entry for the old somefile/file
as expected. Again, I execute git add
in the same two variants.
What I expected:
git add somefile/file
: Normally, remove entry for the old somefile/file
. But if it peeks around, it should also add new entry for somefile
.git add somefile
: equivalent to git add .
.What actually happens:
git add somefile/file
: leads to an empty index file - so, it does what I normally expect it to do!git add somefile
: equivalent to git add .
, as expectedHere, git add
behaves as a "do what you're told to do" command. It only picks up the paths and overwrites the appropriate section of index file with what the working directory reflects. git add somefile/file
does not poke around and thus does not automatically add an index entry for somefile
.
Up to this point, a possible theory could be that git add
tries to avoid the case of an inconsistent index file - ie, an index file that does not represent a valid work tree. But one extra level of nesting leads to exactly that.
git init
touch file1
git add . && git commit
rm file1 && mkdir file1 && mkdir file1/subdir
touch file1/subdir/something
git add file1/subdir/something
This is similar to case 1, only that the directory here has an extra level of nesting. At this point, the index file consists only of an entry for the old file1
as expected. Again, now we run git add
but with three variants: git add file1
, git add file1/subdir
and git add file1/subdir/something
.
What I expected:
git add file1
: Equivalent to git add .
, leads to single index entry for file1/subdir/something
.git add file1/subdir
and git add file1/subdir/something
: Normally, should only add an entry for file1/subdir/something
(leading to inconsistent index file). But if the above "no-inconsistent-index" theory is correct, these should also remove the old file1
index entry, thus being equivalent to git add .
.What actually happens:
git add file1
: Works as expected, equivalent to git add .
.git add file1/subdir
and git add file1/subdir/something
: Only add a single entry for file1/subdir/something
, leading to an inconsistent index file that cannot be committed.The inconsistent index file I'm referring to is:
100644 <object addr> 0 file1
100644 <object addr> 0 file1/subdir/something
So just adding another level of nesting seems to stop git add
from peeking around as it did in case 1! Note that the path provided to git add
didn't matter too - both file1/subdir
and file1/subdir/something
lead to inconsistent index file.
The above cases paint a very complicated implementation of git add
. Am I missing something here, or is git add
really not as simple as it seems?
Actually, this just means you have found a bug in (at least some versions of) Git.
Git understands that OSes cannot support two entities, one being a file and another being a directory/folder, with the same name. That is, we cannot have both file1
being a file and file1
being a directory.1
Now, the thing about Git's index is that it has no ability to hold directories in it at all.2 The only allowed entities are files. So either file1
exists, or else file1/subdir/something
exists, but never both. Git has a bunch of rather complicated code inside it, for both the index itself and for handling of OS-level files during git checkout
, git reset
, and the like, that is supposed to take care of "D/F" (directory/file) conflicts. Git needs to be able to handle these when doing a git checkout
of a commit where somefile
is a file, then git checkout
of a different commit where somefile/file
is a file so somefile
must be removed and a directory must be inserted. It needs to be able to handle the checkout where we go back to the first situation, so that somefile/file
must be removed, then somefile/
must be rmdir-ed, then somefile
can be created as a file. And, it must handle merges where somefile
was a file in one or two of the three commits but somefile/file
exists in the other two or one commits.
Apparently, someone missed a corner case. I was able to reproduce this myself, using your steps, and:
$ git ls-files --stage
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0 file1
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0 file1/subdir/something
$ git write-tree
You have both file1 and file1/subdir/something
fatal: git-write-tree: error building trees
This state is not supposed to exist. It's the addition of file1
-as-a-directory that erases the index slot containing file1
:
$ git add file1
$ git ls-files --stage
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0 file1/subdir/something
as this triggers the code that strips out the now-undesirable entry.
(It's pretty clear that this needs a fix and a test-suite test case. Fortunately Git self-detects the bad case during the tree-build process, so that it does not make bad commits.)
1I think perhaps we should be able to do this, but it's currently forbidden by POSIX rules and none of the Unix-like file systems support it. It would make a mess of archivers like tar
, too.
2This is not quite strictly true: for various speedup purposes, the index holds "irregular" (non-cache) entries as well as the normal cache entries that describe the proposed next commit. It's the cache entries that do not hold directory existence; the entries that aren't stuff-to-be-committed can hold all kinds of auxiliary information. But none of these are shown by git ls-files
.