Search code examples
gitversion-controlgit-log

Git log (--follow) not working to show history beyond renames


I try to show the full history of a file in my git via gitlog. The problem is that the parent folder of this file was renamed in the history, and I like to see the full history.

The git-log documentation says that the arguments --follow and -M show make git log following the renames.

I tried different combinations of the gitlog arguments like

git log -M --oneline --all -- --follow newpath/my-file.php

git log -M --oneline --all -- newpath/my-file.php and even

git rev-list --all -- newpath/my-file.php --objects --in-commit-order | git log --no-walk --oneline --stdin

But whatever I try the history always ends at the commit where the parent folder of the file was renamed.

I already can confirm that:

  • only the folder was renamed in the rename commit, the contents of the file are 100% unchanged, so git should simple discover that the file on the old path and the file on the new path are identicall and yust renamed.

  • git shot name-status for the rename commit shows R100 oldpath/my-file.php newpath/my-file.php (that confirms that contents of the file are 100% identical)

  • The "old-half" and the "new-half" of the history seem to be correct, both include the rename-commt

  • When I run git log -M --oneline --all -- --follow newpath/my-file.php the oldest commit is 0979744 renamed: oldpath/ -> newpath/

  • When I run git log -M --oneline --all -- --follow oldpath/my-file.php the latest commit is 0979744 renamed: oldpath/ -> newpath/

So everything looks like my git successfull understands that the file in the new path and the file in the old path are renamed. Can anybody tell me why the history still breaks on the rename commit even when I use the -M and --follow options ?


Solution

  • As noted in comments, the --follow option must precede the stand-alone -- that indicates the end of the options list.

    Even it the follow renames seems to work now, when I add --grep="rename" --invert-grep to remove the "rename" commit, I get 0 results

    That makes sense (but is a bug of sorts),1 because of the the way --follow works. The issue here is that Git doesn't have any kind of file history at all. All that Git has, is the set of commits that are in the repository. The commits are the history:

    • Each commit is numbered, by its big ugly hash ID, which is unique to that one particular commit. No other commit—in any Git repository2—has that hash ID.

    • Each commit has a full snapshot of every file.

    • Each commit also stores the hash ID of a previous commit—or, for a merge commit, two or more previous commits.

    So these numbers string commits together, backwards:

    ... <-F <-G <-H
    

    The uppercase letters here stand in for the actual commit hash IDs, by which Git finds the commits. Each commit has a "backwards-pointing arrow" coming out of it—the stored hash ID of the previous commit—so that if we could just remember the hash ID of the last commit in the chain, we could have Git work backwards through the chain.

    A branch name just tells Git which commit is the last commit in that branch:

                 I--J   <-- feature1
                /
    ...--F--G--H
                \
                 K--L   <-- feature2
    

    Here, commit J is the last commit one of the feature branches and commit L is the last commit on another. Note that commits up through H are on both branches (and quite likely also on the main or master branch as well).

    The git log command simply works through the commits, one at a time, starting from whatever "last commit" you choose. The default "last commit" is the one at the tip of whatever branch you have checked out right now. This process works backwards: Git starts with the last commit and works backwards, one commit at a time.

    The -M option to git diff, which is short for --find-renames, enables rename detection in git diff. The --follow option to git log does the same for git log, but also takes the name of one single file to look for. (Giving the -M option to git log makes it use the rename detector at each diff, but since it's not looking for one specific file, that just affects the -p or --name-status style of output. With --follow, git log is looking for that one specific file, as we'll see in a moment.)

    The rename detector works this way:

    • You give Git two commits, before and after or old and new or, say, F and G. (You can put the new commit on the left side, and the old one on the right, but git log itself always puts older on left, newer on right.)

    • You have Git compare the snapshots in these two commits.

    • Some files in those commits are 100% identical: they have the same name and the same content. Git's internal storage system has de-duplicated these files and this makes it very easy for git diff or git log to decide that these files are the same, so it can skip right over them if appropriate.

    • Other files have the same names but different contents. Git assumes, by default, that if the two files have the same name—such as path/to/file.ext: note that the embedded slashes are just part of the file's name—they represent the "same file", even if the contents have changed. So that file is modified, from the old / left-side commit to the new / right-side commit. If you ask for --name-status, you'll get M, modified, as the status for that file name.

    • Sometimes, the left-side commit has a file named, say, fileL, and the right-side commit doesn't have that file at all. That file is deleted, apparently, in the change from old (left) to new (right). With --name-status you would get D for the status.

    • Sometimes, the right-side commit has a file named, say, fileR, and the left-side commit just doesn't. That file is newly added, apparently, and with --name-status you would get A for the status.

    • But what if fileL on the left and fileR on the right should be considered to be "the same file"? That is, what if we renamed fileL to fileR? This is where Git's rename detector comes in. Given deleted/added pair like this, maybe the content of fileL is sufficiently close to, or exactly the same as, the content of fileR. If:

      • you have turned on the rename detector, which will actually do this content-checking, and
      • the content-checking says "exactly the same" (very fast to know due to the de-duplication) or "sufficiently similar" (much slower, but enabled by the same rename-detector switch),

      then—and only then—Git will declare that fileL was renamed to become fileR. The --name-status output will include R, the similarity index value, and the two file names, rather than the single file name that matches in both left and right side commits.

    Now that you know how the rename detector works—and that it has to be switched on—you can see how --follow works. Remember that with git log, you can give it a file name, and tell it not to show commits that don't modify that particular file.3 The result is that you only see commits that do modify that file: a subset of the set of all commits that git log visits. So let's say you run git log --follow -- newpath/my-file.php:

    • git log walks through history, one commit at a time, backwards, as usual.

    • At each commit, it compares this commit (newer, on right) against its parent (older, on left). Without --follow it would still do this, but just look to see if the file you named was changed (M status, from git diff --name-status) or added or deleted (A, D).4 But with --follow, it also looks for an R status.

    • If the file was changed—has M or A or D status—git log prints out this commit, but if not, it just suppresses the printout. With --follow, we add the R status and, if that happens, the two file names. If the status is R, well, git log has been looking for newpath/my-file.php before. But now it knows that, as of the parent commit, the file was called oldpath/my-file.php. (Note, again, that there is no folder here. The file's name is the whole string, including all the slashes.)

    So, with --follow—which turns on the rename detector—git log can get a renamed status and therefore see that the file gets renamed. It's also looking for one specific file name, in this case, newpath/my-file.php. If it detects a rename, git log not only prints the commit, but also changes the one name it is looking for. Now, instead of newpath/my-file.php, from the parent commit on backwards, it is looking for oldpath/my-file.php.


    1The --follow code itself is ... not very good; the whole implementation needs to be reworked, which would probably fix this better than the simpler hack I'm thinking of.

    2Technically, some other Git repository could have a different commit that re-uses that hash ID, as long as you never introduce the two commits to each other. In practice, you won't find one, though.

    3The --follow option can only follow one file name. Without --follow, you can give git log more than one name, or the name of a "directory" even though Git doesn't really store directories at all. Without --follow the git log code operates on generic pathspecs. With --follow, it only handles one file name. That's a limitation imposed by the algorithm Git is using here.

    4It could also have T, type-changed, and I think that would count. The full set of status letters is ABCDMRTUX but X indicates a bug in Git, U can only occur during an unfinished merge, B can only occur with git diff with the -B option, and C and R can only occur with the --find-copies and --find-renames (-C and -M) options enabled. Note that git diff may automatically enable --find-renames based on your diff.renames setting, but git log won't.


    The bugs in --follow

    This process, of removing some commits from the output display from git log, is called History Simplification. There is a long section in the documentation that describes this, and it begins with this rather odd claim:

    Sometimes you are only interested in parts of the history, for example the commits modifying a particular <path>. But there are two parts of History Simplification, one part is selecting the commits and the other is how to do it, as there are various strategies to simplify the history.

    What this weird phrasing—"one part is selecting the commits and the other is how to do it"—is trying to get at is that with history simplification enabled, git log sometimes doesn't even walk some commits. In particular, consider a merge commit, where two strings-of-commits come together:

              C--...--K
             /         \
    ...--A--B           M--N--O   <-- branch
             \         /
              D--...--L
    

    To show all commits, git log will have to walk commit O, then N, then M, then both K and L (in some order), then all the commits before K and all the commits before L going back to C and D, and then rejoin a single thread at commit B and keep going from there, backwards.

    If we're not going to show all commits, though, maybe—just maybe—at commit M, we could just go back to only commit K or only commit L and ignore the other "side" of the merge entirely. That will save a lot of work and time, and avoid showing you stuff that's irrelevant. This is usually a really good thing.

    For --follow, however, it's often a pretty bad thing. This is one of --follow's issues: sometimes Git will go down the "wrong leg" when doing this kind of simplification. Adding --full-history avoids this, but we immediately stumble into another problem. The --follow option has only one file name. If we have a rename in one of the two legs of the commit, but not in the other, and git log goes down the rename leg first, it may look for the wrong name when it goes down the other leg.

    If the file is renamed in both legs, so that it's renamed from M back to K and from M back to L, or if Git happens to go down the correct leg in the first place and you don't care about the other leg, everything works. But it's something to be aware of. (This is not the problem that's hitting you with --grep, or it would occur without --grep.)

    I think the bug you are seeing is that --grep is firing off "too early", as it were. The --grep option works by eliminating, from git log's output, any commit that has (--invert-grep) or lacks (--grep without --invert-grep) some particular text in its commit message. Suppose, then, that the rename commit—the one that causes git log --follow to know to use the name oldpath/my-file.php—gets skipped by your --grep option. Git won't see the R status, and won't know to change the name from newpath/my-file.php to oldpath/my-file.php. So git log --follow will keep looking for the new path, and you'll get only those commits that both meet the grep criteria and modify a file with the new name.

    This bug could be fixed by having git log --follow run the diff engine anyway, even if it's going to skip the commit for other reasons. But more generally --follow needs a complete rewrite: it has a bunch of weird special case code threaded through the diff engine just to make this one case work. It needs to handle multiple path names and/or pathspecs, and work with --reverse and other options. It needs a way to stack old and new names onto commit paths, so that with --full-history, going down both legs of merges, it knows which path to be looking for. Note that this has other implications: what if, going down both legs of a merge, there are different renames? If there was a rename/rename conflict that someone fixed manually in the merge, how do we deal with that?