Search code examples
gitgitignore

Why does git remove an ignored file (but not an un-ignored file) when pulling a commit that untracked both files?


Consider the following situation:

  1. a remote repository creates and tracks two files
  2. the remote repository later on adds one of the two files to .gitignore (that's not correct, I know, but it happened in our organization)
  3. a local repository clones the remote one
  4. in the local repository, we untrack both files with git rm --cached
  5. the remote pulls those changes

I would expect the remote to still have both files, and to stop tracking them from now on.

Why does this happen?

Here is a bash script MWE that replicates what I mean

#! /bin/bash

# 1. Set up remote repository
mkdir remote

cd remote
git init .
touch file_to_remain.txt
touch file_to_remove.txt
touch file_to_ignore_and_remove.txt
git add .
git commit -m 'first commit'
echo "file_to_ignore_and_remove.txt" > .gitignore
git add .
git commit -m 'gitignore ignores a file that is already in the index'

# 2. clone local repo
cd ../
git clone ./remote local

# 3. untrack both files
cd local
git rm --cached file_to_ignore_and_remove.txt
git rm --cached file_to_remove.txt
git add .
git commit -m 'removed two files from index'

# 4. pull changes into remote
cd ../remote
git remote add origin `pwd`/../local
git pull origin master

Instead, what happens is that:

  • both files are still present in the local repository
  • on the remote, the ignored file is deleted while the non-ignored file is still present.

An additional discovery: if I do a git status at before the commit in stage 3 (in the MWE), the file_to_remove.txt is shown as both deleted and untracked, while file_to_ignore_and_remove.txt is shown only as deleted. When I do a git add . only the deletion of file_to_ignore_and_remove.txt is recorded.


Solution

  • Your problem starts right in step 1, with this supposition:

    1. a remote repository creates and tracks two files

    A repository does not track files (nor not-track files). A Git repository consists of, mainly, a set of commits. Each commit contains a full and complete snapshot of all of the files that whoever made that commit told Git to include in that commit.

    What this means—before we get into the issue of tracked vs untracked at all—that we can have a commit a123456 that contains files f1 and f2, another commit b56789a that contains files f2, f3, and secret, and a third commit cbcdef0 that contains files f3 and f1.

    After successfully checking out commit a123456, you'll find that you have files named f1 and f2, with whatever contents are in the snapshot in commit a123456. After successfully checking out commit cbcdef0, you'll find that you have files f1 and f3, with whatever contents are in the snapshot in commit cbcdef0. It doesn't matter what's in commit b56789a here because we never checked it out, even though the repository has that commit. We never notice the file named secret because we never look inside the commit that has that file.

    Git works on a commit-by-commit basis. We pick some commit to work on or with, using git checkout or git switch.1 Because all parts of any existing commit are entirely read-only, and the files inside each commit are stored in a special, Git-only, compressed and de-duplicated format that the other programs in your computer can't use, this "pick some commit to work with" step works by copying the files out of the commit, into a work area. The files inside the commit are not visible! They are private to Git itself. The files you work with are visible, and are ordinary files, but they were merely copied out of the files that Git actually uses (which are stored via Git objects, kept in a database inside Git).

    What this means is that the files you see and work with, when you use Git, are not actually in Git. That's important to keep in mind for the rest of this. Meanwhile, remember that Git's basic "unit of storage", as it were, is the commit. Each commit holds every file that Git knows about for that commit, by definition. But these files are in Git, and are merely going to be copied out when you ask.


    1The new-in-Git-2.23 git switch implements the "safe" subset of git checkout, with the new-in-Git-2.23 git restore implementing the "unsafe" subset. If you always use Git 2.23 or later, and want to avoid certain tragic accidents, it's a good idea to train yourself to use the two new commands. The one old git checkout command continues to work, though, so if you're already self-trained to use only git checkout, you can continue to do that.


    More about commits

    Besides the facts that we already described or assumed above—that commits are numbered by those big ugly random-looking hash IDs you've seen, and that each commit has a full and complete (but read-only) snapshot of every file that goes with that commit, there's one more thing to know about commits: Each one contains some metadata, which, like all parts of a commit, is completely read-only: you can't change it after creating the commit.

    The metadata inside a commit includes things like the name and email address of the person who made the commit. It includes date-and-time-stamps (two, for various reasons). It includes a log message, in which whoever made the commit should explain why they made the commit, though the quality of this explanation is up to the commit author. Most importantly for Git itself, though, the metadata of any one commit will include a list of earlier commit hash IDs. These are the parents of the commit in question.

    Most commits have just one parent. These are ordinary (non-merge) commits. They are the history in a Git repository. Adding a new commit, or many new commits, to some Git repository, while keeping the existing ones, is how we add history. We can do this one commit at a time—by making a commit ourselves—or en masse, by fetching many commits from some other Git repository. The secret that makes all this work has to do with those big ugly hash IDs. We won't cover that properly here; we'll just say that every Git uses the same cryptographic hash function to compute the hash IDs, so that all Gits agree that any particular commit gets its particular hash ID.

    In any case, Git arranges for each new commit to remember the hash ID of its immediately previous commit. This means that ordinary (non-merge) commits form a simple backwards-looking chain of commits:

    ... <-F <-G <-H
    

    Here, we've replaced each actual hash ID with a single uppercase letter, and drawn commits with the latest one on the right. We've called its hash ID H. Git can find this commit, by hash ID, because it's stored in the database of all Git objects, indexed by hash IDs. That lets Git get at the stored snapshot, and also at the metadata.

    Note, though, that commit H's metadata includes the hash ID of earlier commit G. So Git can find commit G, which lets Git get at G's stored snapshot. By comparing the stored snapshots—G vs H—Git can tell us what changed between the two commits. This is where things will get interesting.

    Of course, since G is an ordinary commit, with one parent, G has F's hash ID stored in its metadata. That means Git can walk back from G to F too, and compare the two snapshots. Meanwhile F stores in its metadata, its parent hash ID, so Git can walk back another step as needed. This repeats all the way to the very first commit—the beginning of history—where that first commit simply has no parent at all, to tell Git: This is the beginning. That's where git log, for instance, will stop, having run out of commits.

    Ultimately, this is also how branch names work. We won't go into the details here, but each branch name just holds one hash ID. That one hash ID is the ID of the commit that we wish to call the last commit in some chain of commits. Even if there are more commits after this point, that particular marked commit is the end of this chain. Because of this, Git ends up working backwards all the time: it starts with the end, and works back through history to the beginning.

    Remember that none of this stuff inside commits can be changed. Branch names can be changed: you can rename a branch if you want, for instance, but more importantly, you can stuff a different hash ID into a branch name. When you do that, the commit that's the end of the chain has changed. This is how branches grow, or—if needed—shrink: by moving the branch name to new commits that we add, or by moving the branch name backwards, to some historical commit.

    Git's index and your working tree

    Now we come to the issue of actually working with a commit. As we noted before, everything inside a commit—both the snapshot of all files, and the metadata—is read-only. Nothing can change it, not even Git (because of the hash trick that makes Git work as a distributed system). But to make use of the files in a commit, we definitely need to be able to read them—which we can't when they are in the object-ized and de-duplicated internal Git format—and we almost certainly need to be able to write to them too.

    This is why Git copies the files out of the commit. The copies go into a working area, which Git calls your working tree or work-tree. As we saw earlier, these are the files you can see and work with. They are literally just ordinary everyday files. Git does not control this work area! Git does—usually—make it, initially, on git clone, but you make it, initially, if you use git init. This work area is now yours to do with as you see fit. Just remember that git checkout is a request for Git to fill your work area with files, extracted from some commit.

    Note that this means that there must necessarily be at least two active copies of your various files:

    • there is the frozen one in the current commit, that Git took out and put in your work-tree for you; and
    • there is the one in your work-tree, that you're working on / with.

    Git could stop here, with a work area full of files, and with commits. Some other version control systems do that. But this isn't what Git does. Instead, Git squeezes, in between its frozen committed copy of each file and your work-tree copy, a third copy. This means that if you have those files f1 and f2 in commit a123456 and that's your current commit, Git will have:

    • a frozen copy of each file, f1 and f2, in its commit;
    • another "copy" of f1 and f2 ready to go into the next commit; and
    • the usable copy of f1 and f2 in your work-tree.

    This middle "copy"—in quotes here, because it's in Git's internal format, which de-duplicates files, so it initially literally just shares the originals from the commit—of each file lives in an area that Git gives three names. Git calls this the index, or the staging area, or—rarely these days—the cache. The last name mostly shows up in flags, like git rm --cached.

    What's special about the index copy of each file is that Git will let you replace it. The copy in the commit can't be replaced, because nothing about any existing commit can be changed. But the index is merely a proposed commit. It's not actually a commit yet. So what's in Git's index can be changed.

    This is what git add does. This is also what git rm --cached does: it changes the proposed next commit. Changing a proposed commit doesn't affect any existing commit, so that's OK. Git achieves this change by doing one of three things:

    • replace some existing file: overwrite the index f1 with a new version;2
    • add a new-to-the-proposed commit file: create a new index entry for a file we didn't have in the proposed commit; or
    • remove a file from the proposed next commit.

    All of these changes, then, happen in Git's index. This means that the proposed next commit is always up to date, and when you run git add to make that commit, Git just has to snapshot whatever is in Git's index.

    This leads to the definition of a tracked file. Technically, Git just defines the term untracked file, but it's pretty obvious how to invert that.


    2Since the old file version content is shared, this doesn't actually overwrite it, but rather just makes a new one, or finds some other existing copy to de-duplicate. The actual mechanism behind this uses the same object hash ID tricks that Git uses for commits. Commits always get a unique hash ID, because something about each commit is always guaranteed to be unique. (There's a lot of magic behind this, and the date isn't actually required here, but one way Git guarantees this is that every commit gets "now" as a date-and-time stamp. You'd have to make multiple commits every second for this part to be the same.) File contents, if they duplicate some earlier saved version, will just wind up using the old version's hash ID, and hence be automatically de-duplicated.


    Tracked vs untracked files

    Once you know how Git uses its index, as the proposed next commit, the definition of a tracked or untracked file becomes almost trivial: A tracked file is one that is in Git's index.

    That's all there is to it, really. An untracked file is a file you have in your working tree right now, that is not in Git's index right now. Put that file in Git's index—with git add, for instance—and it becomes a tracked file. Remove it from Git's index—with git rm --cached, for instance—and it becomes an untracked file. You can run git add or git rm --cached whenever you like, so you can convert a file from tracked to untracked, or vice versa, whenever you like.

    But there's a big, and rather hairy, wrinkle here. When you run git checkout to pick a commit to use, Git will:

    1. fill in Git's index from the commit; and then
    2. fill in your work-tree from Git's index.

    Suppose you're on commit a123456, which has files f1 and f2. You got there in a normal everyday way and you now have files f1 and f2 in your working tree. Git has f1 and f2 in its index. All three copies of each of these two files match, so it's quite safe to move from a123456 to, say, cbcdef0. So you run git checkout on a branch name that identifies commit cbcdef0, for instance, to switch to that commit.

    Commit cbcdef0 says that we should have files named f1 and f3. Git's index currently has f1 and f2 in it. To make Git's index hold f1 and f3, Git must remove f2 from the index. Because f2 is a tracked file—it's in the index—Git will also remove f2 from your working tree. Git can put the right copies of f1 and f3 in its index and your working tree, and the checkout is finished and file f2 is gone.

    But it's not really gone, is it? It's perfectly safe, in commit a123456. Just git checkout that commit and file f3 will vanish—it exists now and is tracked but shouldn't exist because a123456 lacks file f3—and file f2 will come back, extracted from a123456, now in both Git's index and your working tree.

    Note that you can, if you like, run git rm --cached f3 right now, before switching to a123456. That removes f3 from Git's index. Now f3 is an untracked file. Now you can git checkout commit a123456, and Git won't remove f3, because f3 is not in Git's index. The fact that you have a file f3 lying around in your working tree: well, that's your business. It's not in Git's index, so it's an untracked files: one of yours, that you've left there for whatever purpose: it's not for Git to bother with.

    Your local and "remote" repositories

    But now, in your example, you've thrown another repository-and-work-tree combination into the mix. You now have your local repository, in which you're running git rm --cached and git add and making new commits. That's all fine! But you also have, somewhere—on your own machine, or on some other machine—another Git repository. That Git repository has its own index and its own working tree.

    If you are on some commit, and remove some file with git rm --cached, it's now gone from Git's index, but still in your working tree. You make the new commit, which lacks the file, and all is still fine.

    But now, over in the other Git repository, you do something that obtains this new commit. You still have the old commit checked out, and that has some files that are in both Git's index and your working tree here, on this machine in this other repository. Now you tell Git: switch to some other commit and the other commit lacks the file. The file is in Git's index here—this index is part of this repository—so the file is tracked here, so Git removes the file, just as it should.

    You will always face this problem in other Git repositories that have commits checked out, if you use this kind of git rm --cached trick to remove a file from Git's index and make a new commit while keeping the file in your working tree. That's because their index and their Git are never told to keep the file. All they see is "old commit has file, new commit lacks file": that's an instruction to delete the file. It's not gone from the repository, but it is gone from the working tree.

    About .gitignore

    The .gitignore file is misnamed. Git will build a new commit from whatever is in Git's index. That's the proposed next commit. If a file is in Git's index, it is in Git's index, whether or not the name or pattern is in a .gitignore file. Hence this:

    1. the remote repository later on adds one of the two files to .gitignore (that's not correct, I know, but it happened in our organization)

    isn't *immediately harmful. Later, it can be; it's not exactly wrong but I think it's a bad practice myself.

    The .gitignore file has several functions. The most important ones are the ones that affect people doing new work.

    When you run git status, your Git:

    1. prints some generally useful stuff, like on branch xyzzy;
    2. may or may not print something about files staged for commit;
    3. may or may not print something about files not staged for commit; and
    4. may or may not print stuff about untracked files.

    The list of file names printed in step 2 is a result of comparing the current commit to the proposed next commit. For each file that is the same, Git says nothing at all. For each file that is different in any way, Git says that this file is staged for commit, along with a status-code letter: M for modified (file exists in both HEAD commit and index / staging-area), D for deleted (file exists in HEAD but not in index); A for added, and so on.3

    The list of names printed in step 3 is a result of comparing the proposed next commit to your working tree. That is, Git diffs the files in the index vs the files in your work-tree. For those that are the same, Git says nothing at all. For those that are different, Git prints much the same as before, but now calls these not staged for commit. You can run git add on these to copy the work-tree version of the file into the index.

    What's a bit peculiar here is that instead of saying that some files are Added, Git gathers up all those file names and then shuffles them off into step 4. These are your untracked files. Git now whines about them, implying that you ought to use git add to add them.

    In many setups, there will be many untracked files that definitely shouldn't be git added, because they should not be in the next commit. Adding them puts them into the proposed next commit, which means now you have to go take them out (git rm --cached) so that they don't actually go into the commit when you do make it. We'd like to get git status to just shut up about this.

    So, we list these files in .gitignore. That means that instead of .gitignore, perhaps this file should be named .git-do-not-whine-about-these-untracked-files-in-git-status-output.

    But we also have an easy way of git add-ing all files: we run git add . or git add * or git add -A or whatever. This tells Git to add all files, en masse, in one swell foop. But there are files we don't want to be added: those same files we had git status shut up about. So the file should be called .git-do-not-whine-about-these-untracked-files-in-git-status-output-and-do-not-auto-add-them-when-I-use-an-en-masse-style-git-add-operation, or something like that.

    The files aren't literally ignored, because any file that's in Git's index will be in the next commit. So .gitignore is clearly the wrong name. But the right name is way too long and ridiculous to type in: we might as well call this file .gitignore.

    There's one other thing that .gitignore does, though, and it's where listing a tracked file—specifically, any file that will become a tracked file upon checking out some commit—can be dangerous. What the .gitignore listing does is give Git permission, in some circumstances,4 to overwrite or remove the file, clobbering data that you might not have saved anywhere. This is why I don't like the situation where you have a file that's tracked, but also matches a .gitignore pattern: it sets you up to lose data later by accident.

    (This means the full correct name might really be .git-do-not-whine-about-these-untracked-files-in-git-status-output-and-do-not-auto-add-them-when-I-use-an-en-masse-style-git-add-operation-but-do-feel-free-to-clobber-them. That's ... even more ridiculous.)


    3When you're in the middle of a conflicted merge, all of this changes. The git status documentation is being improved for the next release of Git, to try to help make more sense of the status code letters. I won't go into further detail here though.

    4For a very long and wide-ranging discussion of this problem, see, e.g., this thread from the Git mailing list archives. The usual egregious case (wiping out a local configuration) occurs when you git merge a commit that has a configuration file in it, that is not in your current commit, but is present and is listed in a .gitignore file.