Consider the following situation:
git rm --cached
I would expect the remote to still have both files, and to stop tracking them from now on.
Why does this happen?
Here is a bash script MWE that replicates what I mean
#! /bin/bash
# 1. Set up remote repository
mkdir remote
cd remote
git init .
touch file_to_remain.txt
touch file_to_remove.txt
touch file_to_ignore_and_remove.txt
git add .
git commit -m 'first commit'
echo "file_to_ignore_and_remove.txt" > .gitignore
git add .
git commit -m 'gitignore ignores a file that is already in the index'
# 2. clone local repo
cd ../
git clone ./remote local
# 3. untrack both files
cd local
git rm --cached file_to_ignore_and_remove.txt
git rm --cached file_to_remove.txt
git add .
git commit -m 'removed two files from index'
# 4. pull changes into remote
cd ../remote
git remote add origin `pwd`/../local
git pull origin master
Instead, what happens is that:
An additional discovery: if I do a git status
at before the commit in stage 3 (in the MWE), the file_to_remove.txt
is shown as both deleted and untracked, while file_to_ignore_and_remove.txt
is shown only as deleted. When I do a git add .
only the deletion of file_to_ignore_and_remove.txt
is recorded.
Your problem starts right in step 1, with this supposition:
- a remote repository creates and tracks two files
A repository does not track files (nor not-track files). A Git repository consists of, mainly, a set of commits. Each commit contains a full and complete snapshot of all of the files that whoever made that commit told Git to include in that commit.
What this means—before we get into the issue of tracked vs untracked at all—that we can have a commit a123456
that contains files f1
and f2
, another commit b56789a
that contains files f2
, f3
, and secret
, and a third commit cbcdef0
that contains files f3
and f1
.
After successfully checking out commit a123456
, you'll find that you have files named f1
and f2
, with whatever contents are in the snapshot in commit a123456
. After successfully checking out commit cbcdef0
, you'll find that you have files f1
and f3
, with whatever contents are in the snapshot in commit cbcdef0
. It doesn't matter what's in commit b56789a
here because we never checked it out, even though the repository has that commit. We never notice the file named secret
because we never look inside the commit that has that file.
Git works on a commit-by-commit basis. We pick some commit to work on or with, using git checkout
or git switch
.1 Because all parts of any existing commit are entirely read-only, and the files inside each commit are stored in a special, Git-only, compressed and de-duplicated format that the other programs in your computer can't use, this "pick some commit to work with" step works by copying the files out of the commit, into a work area. The files inside the commit are not visible! They are private to Git itself. The files you work with are visible, and are ordinary files, but they were merely copied out of the files that Git actually uses (which are stored via Git objects, kept in a database inside Git).
What this means is that the files you see and work with, when you use Git, are not actually in Git. That's important to keep in mind for the rest of this. Meanwhile, remember that Git's basic "unit of storage", as it were, is the commit. Each commit holds every file that Git knows about for that commit, by definition. But these files are in Git, and are merely going to be copied out when you ask.
1The new-in-Git-2.23 git switch
implements the "safe" subset of git checkout
, with the new-in-Git-2.23 git restore
implementing the "unsafe" subset. If you always use Git 2.23 or later, and want to avoid certain tragic accidents, it's a good idea to train yourself to use the two new commands. The one old git checkout
command continues to work, though, so if you're already self-trained to use only git checkout
, you can continue to do that.
Besides the facts that we already described or assumed above—that commits are numbered by those big ugly random-looking hash IDs you've seen, and that each commit has a full and complete (but read-only) snapshot of every file that goes with that commit, there's one more thing to know about commits: Each one contains some metadata, which, like all parts of a commit, is completely read-only: you can't change it after creating the commit.
The metadata inside a commit includes things like the name and email address of the person who made the commit. It includes date-and-time-stamps (two, for various reasons). It includes a log message, in which whoever made the commit should explain why they made the commit, though the quality of this explanation is up to the commit author. Most importantly for Git itself, though, the metadata of any one commit will include a list of earlier commit hash IDs. These are the parents of the commit in question.
Most commits have just one parent. These are ordinary (non-merge) commits. They are the history in a Git repository. Adding a new commit, or many new commits, to some Git repository, while keeping the existing ones, is how we add history. We can do this one commit at a time—by making a commit ourselves—or en masse, by fetching many commits from some other Git repository. The secret that makes all this work has to do with those big ugly hash IDs. We won't cover that properly here; we'll just say that every Git uses the same cryptographic hash function to compute the hash IDs, so that all Gits agree that any particular commit gets its particular hash ID.
In any case, Git arranges for each new commit to remember the hash ID of its immediately previous commit. This means that ordinary (non-merge) commits form a simple backwards-looking chain of commits:
... <-F <-G <-H
Here, we've replaced each actual hash ID with a single uppercase letter, and drawn commits with the latest one on the right. We've called its hash ID H
. Git can find this commit, by hash ID, because it's stored in the database of all Git objects, indexed by hash IDs. That lets Git get at the stored snapshot, and also at the metadata.
Note, though, that commit H
's metadata includes the hash ID of earlier commit G
. So Git can find commit G
, which lets Git get at G
's stored snapshot. By comparing the stored snapshots—G
vs H
—Git can tell us what changed between the two commits. This is where things will get interesting.
Of course, since G
is an ordinary commit, with one parent, G
has F
's hash ID stored in its metadata. That means Git can walk back from G
to F
too, and compare the two snapshots. Meanwhile F
stores in its metadata, its parent hash ID, so Git can walk back another step as needed. This repeats all the way to the very first commit—the beginning of history—where that first commit simply has no parent at all, to tell Git: This is the beginning. That's where git log
, for instance, will stop, having run out of commits.
Ultimately, this is also how branch names work. We won't go into the details here, but each branch name just holds one hash ID. That one hash ID is the ID of the commit that we wish to call the last commit in some chain of commits. Even if there are more commits after this point, that particular marked commit is the end of this chain. Because of this, Git ends up working backwards all the time: it starts with the end, and works back through history to the beginning.
Remember that none of this stuff inside commits can be changed. Branch names can be changed: you can rename a branch if you want, for instance, but more importantly, you can stuff a different hash ID into a branch name. When you do that, the commit that's the end of the chain has changed. This is how branches grow, or—if needed—shrink: by moving the branch name to new commits that we add, or by moving the branch name backwards, to some historical commit.
Now we come to the issue of actually working with a commit. As we noted before, everything inside a commit—both the snapshot of all files, and the metadata—is read-only. Nothing can change it, not even Git (because of the hash trick that makes Git work as a distributed system). But to make use of the files in a commit, we definitely need to be able to read them—which we can't when they are in the object-ized and de-duplicated internal Git format—and we almost certainly need to be able to write to them too.
This is why Git copies the files out of the commit. The copies go into a working area, which Git calls your working tree or work-tree. As we saw earlier, these are the files you can see and work with. They are literally just ordinary everyday files. Git does not control this work area! Git does—usually—make it, initially, on git clone
, but you make it, initially, if you use git init
. This work area is now yours to do with as you see fit. Just remember that git checkout
is a request for Git to fill your work area with files, extracted from some commit.
Note that this means that there must necessarily be at least two active copies of your various files:
Git could stop here, with a work area full of files, and with commits. Some other version control systems do that. But this isn't what Git does. Instead, Git squeezes, in between its frozen committed copy of each file and your work-tree copy, a third copy. This means that if you have those files f1
and f2
in commit a123456
and that's your current commit, Git will have:
f1
and f2
, in its commit;f1
and f2
ready to go into the next commit; andf1
and f2
in your work-tree.This middle "copy"—in quotes here, because it's in Git's internal format, which de-duplicates files, so it initially literally just shares the originals from the commit—of each file lives in an area that Git gives three names. Git calls this the index, or the staging area, or—rarely these days—the cache. The last name mostly shows up in flags, like git rm --cached
.
What's special about the index copy of each file is that Git will let you replace it. The copy in the commit can't be replaced, because nothing about any existing commit can be changed. But the index is merely a proposed commit. It's not actually a commit yet. So what's in Git's index can be changed.
This is what git add
does. This is also what git rm --cached
does: it changes the proposed next commit. Changing a proposed commit doesn't affect any existing commit, so that's OK. Git achieves this change by doing one of three things:
f1
with a new version;2All of these changes, then, happen in Git's index. This means that the proposed next commit is always up to date, and when you run git add
to make that commit, Git just has to snapshot whatever is in Git's index.
This leads to the definition of a tracked file. Technically, Git just defines the term untracked file, but it's pretty obvious how to invert that.
2Since the old file version content is shared, this doesn't actually overwrite it, but rather just makes a new one, or finds some other existing copy to de-duplicate. The actual mechanism behind this uses the same object hash ID tricks that Git uses for commits. Commits always get a unique hash ID, because something about each commit is always guaranteed to be unique. (There's a lot of magic behind this, and the date isn't actually required here, but one way Git guarantees this is that every commit gets "now" as a date-and-time stamp. You'd have to make multiple commits every second for this part to be the same.) File contents, if they duplicate some earlier saved version, will just wind up using the old version's hash ID, and hence be automatically de-duplicated.
Once you know how Git uses its index, as the proposed next commit, the definition of a tracked or untracked file becomes almost trivial: A tracked file is one that is in Git's index.
That's all there is to it, really. An untracked file is a file you have in your working tree right now, that is not in Git's index right now. Put that file in Git's index—with git add
, for instance—and it becomes a tracked file. Remove it from Git's index—with git rm --cached
, for instance—and it becomes an untracked file. You can run git add
or git rm --cached
whenever you like, so you can convert a file from tracked to untracked, or vice versa, whenever you like.
But there's a big, and rather hairy, wrinkle here. When you run git checkout
to pick a commit to use, Git will:
Suppose you're on commit a123456
, which has files f1
and f2
. You got there in a normal everyday way and you now have files f1
and f2
in your working tree. Git has f1
and f2
in its index. All three copies of each of these two files match, so it's quite safe to move from a123456
to, say, cbcdef0
. So you run git checkout
on a branch name that identifies commit cbcdef0
, for instance, to switch to that commit.
Commit cbcdef0
says that we should have files named f1
and f3
. Git's index currently has f1
and f2
in it. To make Git's index hold f1
and f3
, Git must remove f2
from the index. Because f2
is a tracked file—it's in the index—Git will also remove f2
from your working tree. Git can put the right copies of f1
and f3
in its index and your working tree, and the checkout is finished and file f2
is gone.
But it's not really gone, is it? It's perfectly safe, in commit a123456
. Just git checkout
that commit and file f3
will vanish—it exists now and is tracked but shouldn't exist because a123456
lacks file f3
—and file f2
will come back, extracted from a123456
, now in both Git's index and your working tree.
Note that you can, if you like, run git rm --cached f3
right now, before switching to a123456
. That removes f3
from Git's index. Now f3
is an untracked file. Now you can git checkout
commit a123456
, and Git won't remove f3
, because f3
is not in Git's index. The fact that you have a file f3
lying around in your working tree: well, that's your business. It's not in Git's index, so it's an untracked files: one of yours, that you've left there for whatever purpose: it's not for Git to bother with.
But now, in your example, you've thrown another repository-and-work-tree combination into the mix. You now have your local repository, in which you're running git rm --cached
and git add
and making new commits. That's all fine! But you also have, somewhere—on your own machine, or on some other machine—another Git repository. That Git repository has its own index and its own working tree.
If you are on some commit, and remove some file with git rm --cached
, it's now gone from Git's index, but still in your working tree. You make the new commit, which lacks the file, and all is still fine.
But now, over in the other Git repository, you do something that obtains this new commit. You still have the old commit checked out, and that has some files that are in both Git's index and your working tree here, on this machine in this other repository. Now you tell Git: switch to some other commit and the other commit lacks the file. The file is in Git's index here—this index is part of this repository—so the file is tracked here, so Git removes the file, just as it should.
You will always face this problem in other Git repositories that have commits checked out, if you use this kind of git rm --cached
trick to remove a file from Git's index and make a new commit while keeping the file in your working tree. That's because their index and their Git are never told to keep the file. All they see is "old commit has file, new commit lacks file": that's an instruction to delete the file. It's not gone from the repository, but it is gone from the working tree.
.gitignore
The .gitignore
file is misnamed. Git will build a new commit from whatever is in Git's index. That's the proposed next commit. If a file is in Git's index, it is in Git's index, whether or not the name or pattern is in a .gitignore
file. Hence this:
- the remote repository later on adds one of the two files to .gitignore (that's not correct, I know, but it happened in our organization)
isn't *immediately harmful. Later, it can be; it's not exactly wrong but I think it's a bad practice myself.
The .gitignore
file has several functions. The most important ones are the ones that affect people doing new work.
When you run git status
, your Git:
on branch xyzzy
;The list of file names printed in step 2 is a result of comparing the current commit to the proposed next commit. For each file that is the same, Git says nothing at all. For each file that is different in any way, Git says that this file is staged for commit
, along with a status-code letter: M
for modified (file exists in both HEAD
commit and index / staging-area), D
for deleted (file exists in HEAD
but not in index); A
for added, and so on.3
The list of names printed in step 3 is a result of comparing the proposed next commit to your working tree. That is, Git diffs the files in the index vs the files in your work-tree. For those that are the same, Git says nothing at all. For those that are different, Git prints much the same as before, but now calls these not staged for commit
. You can run git add
on these to copy the work-tree version of the file into the index.
What's a bit peculiar here is that instead of saying that some files are A
dded, Git gathers up all those file names and then shuffles them off into step 4. These are your untracked files. Git now whines about them, implying that you ought to use git add
to add them.
In many setups, there will be many untracked files that definitely shouldn't be git add
ed, because they should not be in the next commit. Adding them puts them into the proposed next commit, which means now you have to go take them out (git rm --cached
) so that they don't actually go into the commit when you do make it. We'd like to get git status
to just shut up about this.
So, we list these files in .gitignore
. That means that instead of .gitignore
, perhaps this file should be named .git-do-not-whine-about-these-untracked-files-in-git-status-output
.
But we also have an easy way of git add
-ing all files: we run git add .
or git add *
or git add -A
or whatever. This tells Git to add all files, en masse, in one swell foop. But there are files we don't want to be added: those same files we had git status
shut up about. So the file should be called .git-do-not-whine-about-these-untracked-files-in-git-status-output-and-do-not-auto-add-them-when-I-use-an-en-masse-style-git-add-operation
, or something like that.
The files aren't literally ignored, because any file that's in Git's index will be in the next commit. So .gitignore
is clearly the wrong name. But the right name is way too long and ridiculous to type in: we might as well call this file .gitignore
.
There's one other thing that .gitignore
does, though, and it's where listing a tracked file—specifically, any file that will become a tracked file upon checking out some commit—can be dangerous. What the .gitignore
listing does is give Git permission, in some circumstances,4 to overwrite or remove the file, clobbering data that you might not have saved anywhere. This is why I don't like the situation where you have a file that's tracked, but also matches a .gitignore
pattern: it sets you up to lose data later by accident.
(This means the full correct name might really be .git-do-not-whine-about-these-untracked-files-in-git-status-output-and-do-not-auto-add-them-when-I-use-an-en-masse-style-git-add-operation-but-do-feel-free-to-clobber-them
. That's ... even more ridiculous.)
3When you're in the middle of a conflicted merge, all of this changes. The git status
documentation is being improved for the next release of Git, to try to help make more sense of the status code letters. I won't go into further detail here though.
4For a very long and wide-ranging discussion of this problem, see, e.g., this thread from the Git mailing list archives. The usual egregious case (wiping out a local configuration) occurs when you git merge
a commit that has a configuration file in it, that is not in your current commit, but is present and is listed in a .gitignore
file.