I have read plenty other questions pertaining to the auto-crlf setting in Git and I am not asking which auto-crlf setting I should be using, or how to normalize my line endings in a project. My question has to do with understanding the auto-crlf setting itself.
Quick background:
I started a project on Linux but am now starting to work on it from a Windows system as well. In the repository and on Linux my files use LF line endings. However, despite having auto-crlf set to "true" on my Windows system (Since before I cloned the project), Git considers certain files "modified" if the only difference is line endings.
It only considers files "modified" if I open up a file, make a change, save and then undo all changes (CRTL+Z or a manual undo) and save again. Every diff utility I use tells me line endings are the only difference (LF in the repo and CRLF on local).
Until recently I always thought this setting effected what files are considered modified in addition to the conversion. But after a second read of the description in conjunction with the behavior I'm experiencing, I'm starting to think it only converts line endings upon commit/checkout and has nothing to do with determining which files have been modified.
Here is where I'm reading the description of this setting.
Is this setting supposed to also effect which files are considered modified in addition to handling conversion?
EDIT:
Just wanted to add to my particular "Background" situation for anyone who shares similar behavior. After reading through toreks answer I was able to determine my IDE was "adding" files to Git upon save automatically. This was causing the "mtime" to change which was the root of the "seemingly" odd behavior.
The true answer is complicated and gets into the dual nature of Git's index, which is as both a "staging area" and a "cache".
It's also worth thinking about Git's smudge filters and clean filters here. In essence, all LF/CRLF conversions are a form of smudging and cleaning.
Whenever you are working within a Git repository, there are three things you have to keep in mind:
The current commit, also known as HEAD
. (The file .git/HEAD
stores part or all of this information: it usually contains the name of a branch, and then the branch name itself contains the rest of the information, namely the current commit hash ID. In "detached HEAD" mode, .git/HEAD
itself contains the hash ID.)
Since all commits are, by definition, read-only, the hash ID suffices to describe it completely. Once Git has resolved HEAD
to a hash ID, Git can get at the stored files.
The index. While the best casual description of the index is "what will go into the next commit", the actual form of the index is rather complicated, so we'll hold off on the details for a moment. This is also where the index starts to play its role of "cache".
The work-tree. As the name implies, this is where you do your real work. It has all the files in their normal format, so that all your programs and tools work with them.
"Normal format" is the key phrase here: the normal format on a Unix-ish system is that lines are newline-terminated, while the normal format for some Windows items are that line are CRLF or '\r\n'
terminated. (We'll just pretend here that all Windows files are like this, though in fact only most files are, with binary files being the first obvious sticking point.)
If you think about smudge and clean filters, the file in the work-tree is in the "smudged" form. That is, if you have something like Git-LFS in operation, Git-LFS is allowed to modify the work-tree version of the file so that it differs in some major way from the committed version. (In particular, Git-LFS tricks Git into saving just a pointer to the actual file, and then Git-LFS retrieves the real—and presumably too-large-for-GitHub or whatever—file from somewhere else, so what's in your work-tree here isn't actually checked in at all!)
Note that the index sits "between" the read-only HEAD
commit and the work-tree. This means that files can be copied from HEAD
to index, or from index to work-tree, or from work-tree to index. (They can't get copied from index to HEAD
except by making a new commit, which then becomes the current commit, because all commits are read-only.)
This is pretty obvious, but is worth stating. If files inside the repository have newline termination format, they don't match the normal format for Windows. Something has to translate back and forth.
The translation gets done, as the Pro Git book notes, during the copying of files into and out of the index. But there are three such possible places: if we copy from HEAD
to index, that puts a (copy of a) file into the index; if we copy from index to work-tree, that makes a copy in that direction; and if we copy from work-tree to index, that makes a copy in the other direction. Now the actual format of the index, and which copies we care about, starts to matter.
The index format is complicated. To see the actual index right now in human-readable form, run git ls-files --stage --debug
, which dumps out a lot of information. (Though even with --debug
this omits some details.) The most crucial and interesting parts are what you see even without --debug
though, e.g.:
100644 4646ce575251b07053f20285be99422d6576603e 0 xdiff/xutils.h
The first value is the "mode" of a file (always 100644 or 100755 for a regular file), the second is a Git hash ID, the third is a stage number (normally zero), and the last is the name of the file.
This hash ID is, at least initially, the same as the hash ID in the original commit. Since that committed file is read-only, that hash ID represents the file in its permanent-storage form, not its work-tree form.
What this in turn means is that the file is in stored in the index in its "cleaned" form (with CRLF turned into LF-only, or Git-LFS replacing the entire file with a pointer). In fact, the cleaned data is already pre-written into the Git repository, and the index stores only its blob hash! This is one of the tricks to make Git go fast: the index entry has just the hash ID (and path name, and mode, and stage number, and all those --debug
output things).
What this also means is that any smudging (to turn LF into CRLF, or retrieve actual files from Git-LFS) happens during the copy from index to work-tree. Any cleaning, to turn CRLF into LF-only or store a new file outside Git and update the pointer, happens during the copy from work-tree to index.
Finally, what else this means is that Git can't easily tell, just from the work-tree file, whether the index version of the file is up to date or not. Is the work-tree version modified? The only way to be sure is to do a new complete cleaning, and see if you get the same hash ID for the resulting data; or do a new complete extraction, and see if you get the same work-tree file. But this process is slow: it can actually take tens of milliseconds, even if you don't have to go through Git-LFS and retrieve or store a copy of the real file somewhere else. Multiply by many files, and it's just too slow. (In a really big repository, git checkout
of a commit can take literally seconds, and this would mean that git status
and other such command would be just as slow.)
Git's answer to this performance dilemma is simply to avoid it entirely if possible. Don't actually build a new repository entity and hash; don't take the existing repository object and re-expand it. What Git does is store information about the work-tree file in the index:
ctime: 1500043102:605208000
mtime: 1500043102:605208000
These two time stamps are the "inode change time" and the "inode modify time", which Git copies from the stat
or lstat
system call result on the work-tree file. As long as the underlying system updates the work-tree time stamps whenever the work-tree file changes, Git can just compare the current time stamps on the work-tree file with the saved time stamps in the index. (Git also saves the work-tree file size, in the same way.) If the time stamps match, the file must be "clean". If the time stamps on the work-tree file are newer than those in the cache, the file may be dirty, and we must do the extra work to find out for sure. (In practice, the time stamps on the index file itself also come into play here, since one second is a very long time in compute terms. See this link for details.)
If you change core.autocrlf
or the text-ness of a file or the smudge and/or clean filters for some particular file(s), this affects how the file would be copied from the index to the work-tree, or from the work-tree to the index. But it has no effect on the cache data stored in the index file. This means Git will think—possibly incorrectly—that the work tree file is "clean", when it isn't.
It only considers files "modified" if I open up a file, make a change, save and then undo all changes (CRTL+Z or a manual undo) and save again.
Writing to the file changes the time stamps on the work-tree file, so that Git will do more work when comparing the work-tree file to the index version.
... I'm starting to think [Git] only converts line endings upon commit/checkout ...
That's mostly right. The conversion from CRLF to LF-only will happen:
git add
, which copies from work-tree to index, or anything that invokes git add
or its underlying code (including the adding for git commit -a
or git commit [--only | --include] -- <paths>
)text
and you have enabled CRLF-to-LF conversion for it.Meanwhile, the conversion from LF-only to CRLF happens:
git checkout
, when it copies from index to work-tree, or a few other more obscure related cases (git read-tree -u
for instance)text
and you have enabled LF-to-CRLF conversion for it.Note that whether and when a file is classified as text
depends on many settings. In general, whatever is in .gitattributes
overrides core.*
settings, but if nothing is set in .gitattributes
, the core.*
settings will apply.
Some of the other tools, such as git show
and git cat-file -p
, are now able to do text conversions via options (in the old days git show <commit>:<path>
showed only the cleaned data, never the smudged form). And for quite a while now, git merge
has supported the concept of "renormalization": doing a virtual check-out plus check-in before diffing and combining-diffs-of the base commit and the two commits-to-merge.