Search code examples
gitgit-filter

git clean filter shows differences in the result of git diff


I setup a clean filter to apply autoformat with uncrustify. The corresponding smudge filter does nothing, it just calls cat.

[filter "autoformat"]
        clean = uncrustify -c ~/tmp/autoformat/uncrustify.cfg --replace
        smudge = cat

My problem is that when I checkout e.g. master, git says my working tree differs from the commit, and the difference that shows is what the clean filter does.

It seems that the clean filter is applied before diff. Is that correct? Is it possible to disable this? Would it be a good idea?

I would like a solution where autoformat is applied to the staging area, that is only the hunks that were staged. Isn't clean filter an appropriate solution?


With a master commit that does not conform to my coding standard in my uncrustify config, checkout -f HEAD^ followed by checkout master -f shows a diff. This is both confusing and cumbersome (git refuses to checkout something else, to prevent losing changes).


Solution

  • It seems that the clean filter is applied before diff. Is that correct?

    Yes. In at least some cases it must be. Consider, for instance, what happens if the smudge filter consists of, say, "double every character" and the clean filter consists of "remove the doubling"—or, if that seems too peculiar, if the smudge filter consists of "translate into some alternate character set" and the clean filter translates back.

    A git diff to compare the work-tree against an actual commit must either run the smudge filter on the commit's content, or run the clean filter on the work-tree's content. Or it might even run both, with the output going to temporary files. (I'm pretty sure I tested this once, long ago, and found that the approach Git used was to run the clean filter, rather than the smudge filter. But see Cyker's comment, which suggests it runs both filters and then diffs smudged results.)

    Is it possible to disable this? Would it be a good idea?

    See above—at best you might have a "run only the smudge filter" option (but there is none).

    Note that what's in the index is already clean, by definition. Cleaning happens on the transition from work-tree to index; smudging happen on the transition from index to work-tree.

    Existing commits are strictly read-only and extracting a commit into the index makes no changes. Hence, while the index contents are clean by definition, if the clean filter itself has changed, they may not match what you would get by re-running the filter.

    I would like a solution where autoformat is applied to the staging area, that is only the hunks that were staged. Isn't clean filter an appropriate solution?

    This does not work the way you are thinking.

    Running git add does not apply diff hunks to the index copy: running git add copies the entire work-tree file into the index. The whole thing gets cleaned.

    Running git add -p also does not actually apply diff hunks to the index copy, because it literally can't. Instead, git add -p extracts the index copy to a temporary file, applies a diff hunk to the temporary file, and then copies the entire temporary file (with applied hunk) into the index, running that through the clean filter. Once again the whole thing gets cleaned—it's just that "the whole thing" is a temporary file built by patching the smudged index copy.

    In other words, the index copy of each file is an entity unto itself, independent of the HEAD commit copy and the work-tree copy. Git starts out, at git checkout time, by just copying the commit copy of the file directly into the index (no changes, no filters), then copies the index copy of the file into the work-tree (smudge filter). At git add time, Git runs the clean filter on the work-tree file (or the patched result) and stuffs that into the index.1


    1Technically, the index holds not the files themselves, but rather their content hashes. Adding a file consists of writing the file into the repository! The hash ID of the resulting blob object goes into the index. The index entry keeps the blob from being garbage-collected, if the index is the only place the blob is used (if the blob matches some committed blob then it's safe from the Grim Collector).