Search code examples
gitgithubgit-diffgit-statusgit-mv

Trying to understand `git diff` and `git mv` rename detection mechanism


This is a followup to another question I asked before.

Before being edited, the initially created file something gets renamed to somethingelse which can be observed here:

git mv something somethingelse

The file somethingelse then gets renamed back to something before the second vim edit:

git mv somethingelse something

Basically in the following portion of the code:

# If you add something to the first line, the rename will not be detected by Git
# However, if you instead create 2 newlines and fill line 3 with new code,
# the rename gets detected for whatever reason
printf "\nCOMMAND: vim something\n\n"
vim something

If at this point I add abc to the code, we would end up with:

First line of code. abc

Which I think is an addition of 4 bytes on line 1, which in turn will end up in this:

On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        new file:   something
        deleted:    somethingelse

Then, if we add a newline and type in abc into the third line (which should also be 4 bytes, correct me if wrong):

First line of code.

abc

Suddenly, Git will detect the rename (edit inclusive):

On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        renamed:    somethingelse -> something

One good answer/comment by @torek given here explains this to a certain extent, taking the git diff rename detection treshold of git status into consideration.

Shouldn't Git behave identically since we added 4 bytes in both cases, but in a different manner or does the newline have something to do with this?


Solution

  • Git's "similarity index" computation is not, as far as I know, documented anywhere other than in the source, starting with diffcore-delta.c.

    To compute the similarity index for two files S (source) and D (destination), Git:

    • reads both files
    • computes a hash table of all of the chunks of file S
    • computes a second hash table of all of the chunks of file D

    The entries in these two hash tables are simply a count of occurrences of instances of that hash value (plus, as noted below, the length of the chunk).

    The hash value for a file chunk is computed by:

    • start at the current file offset (initially zero)
    • read 64 bytes or until '\n' character, whichever occurs first
    • if the file is claimed to be text and there is a '\r' before the '\n', discard the '\r'
    • hash the resulting string-of-up-to-64 bytes using the algorithm shown in the linked file

    Now that there are hash tables for both S and D, each possible hash hi appears nS times in S and nD in D (either may be zero, though the code skips right over both-zero hash values). If the number of occurrences in D is less than or the same as the number of occurrences in S—i.e., nD ≤ nS—then D "copies from S" nD times. If the number of occurrences in D exceeds the number in S (including when the number in S is zero), then D has a "literal add" of nD - nS occurrences of the hashed chunk, and D also copies all nS original occurrences as well.

    Each hashed chunk retains its number-of-input-bytes, and these multiply the number of copies or number of additions of "chunks" to get the number of bytes copied or added. (Deletions, where D lacks items that exist in S, have only indirect effect here: the byte copy and add counts get smaller, but Git does not specifically count the deletions themselves.)

    These two values (src_copied and literal_added) computed in diffcore_count_changes are handed over to function estimate_similarity in diffcore-rename.c. It completely ignores the literal_added count (this count is used in deciding how to build packfile deltas, but not in terms of rename scoring). Instead, only the src_copied number matters:

    score = (int)(src_copied * MAX_SCORE / max_size);
    

    where max_size is the size in bytes of larger of the two input files S and D.

    Note that there is an earlier computation:

    max_size = ((src->size > dst->size) ? src->size : dst->size);
    base_size = ((src->size < dst->size) ? src->size : dst->size);
    delta_size = max_size - base_size;
    

    and if the two files have changed size "too much":

    if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
            return 0;
    

    we never even get into the diffcore-delta.c code to hash them. The minimum_score here is the argument to -M or --find-renames, converted to a scaled number. MAX_SCORE is 60000.0 (type double), so the default minimum_score, when you use the default -M50%, is 30000 (half of 60000). Except for the case of CR-before-LF eating, though, this particular shortcut should not affect the outcome of the more expensive similarity computation.

    [Edit: this is now obsolete:] git status always uses the default. There is no knob to change the threshold (nor the number of files allowed in the rename-finding queue). If there were the code would go here, setting the rename_score field of the diff options. Until Git version 2.18.0, there was no way to control this for git status. In Git 2.18.0 and later, git status has the same --find-renames option as git diff. The status.renames option in the Git configuration enables any default detection, and if unset, git status obeys the diff.renames setting; see the git config documentation and the git status documentation.