Search code examples
gitgit-loggitpython

git log does not return the history of a file correctly


I have a weird problem with the git log command. Although this command:

git log --pretty=format: --name-only --diff-filter=A

returns .xyz.yml file in the list, but when I try to run this command:

git log --pretty="%ad" --diff-filter=A -- .xyz.yml

to retrieve the time this file was added to this repository it returns empty.

Is there any solution for it?

  • I am sure that I write the exact name in the command (so I believe it doesn't have anything to do with case sensitivity)
  • the same problem happens in other repositories
  • I am using centos
  • I checked the problem both with python and GitPython lib and git command line

I would be grateful for any kind of clue.

Edit:

When I try to get full history:

git log --full-history -- .xyz.yml

the output shows a brief and incomplete history

commit b26d833b9da805d5d58c429a4af2d1a5c5b0bad9
Author: author name
Date:   Mon Dec 19 14:07:17 2016 -0500

Code config (#606)

* Create .xyz.yml

Created a Code config that uses your setup (also, enabled our Duplication engine).

* Update .xyz.yml

* Update .xyz.yml

* Update .xyz.yml

* Update .xyz.yml

* Update .xyz.yml

* Update .xyz.yml

* Update .xyz.yml

Even though the file is no longer present in the head, the history does not show any deletion...

I have also had a look at the commit history in the GitHub user interface and I see a whole different world there:

enter image description here

The first commit date is even different than what I can find in --full-history.


Solution

  • (Note: if you don't already know how commits store full snapshots of every file, and link together through their metadata parent information, see, e.g., my answer here.)

    The question from before and after the edit is different, but both questions are related in terms of what's going on. The git log command can perform something Git calls History Simplification. Search the git log documentation for this two-word phrase, and you will find a section that begins with this strangely-worded paragraph:

    Sometimes you are only interested in parts of the history, for example the commits modifying a particular <path>. But there are two parts of History Simplification, one part is selecting the commits and the other is how to do it, as there are various strategies to simplify the history.

    Before we tackle the odd wording here, note that history simplification only occurs if you ask for it. The ways to ask for it are:

    • to use one of the explicit options described in this section, and/or
    • to list some path arguments, as in git log -- .xyz.yml: the .xyz.yml here is a path. (The -- is optional in some cases, and marks the remainder of arguments as paths. If the named paths exist in the current commit and do not resemble other git log options, the -- is not required. It's a good idea to get in the habit of using it always, though, so that you don't have to figure out whether it's required for this particular git log invocation.)

    Since in your troublesome case, you did use -- .xyz.yml, you did ask for History Simplification, even if you did not realize that you asked for it. That's why I added my comment; your reply that using --full-history fixed the problem proved that the default-mode simplification was in fact the problem.

    You then asked:

    full history returns introduction commit. Whats the reason?

    The answer lies in what the documentation calls Default mode:

    Simplifies the history to the simplest history explaining the final state of the tree. Simplest because it prunes some side branches if the end result is the same (i.e. merging branches with the same content)

    This is still rather inexplicable though. The initial paragraph talks about selecting the commits and then how to do it. I think what is missing here is that the documentation never talks about how git log really works.

    What we need to know—that the documentation fails to say—is that the way git log works is to scan through a queue of commits. This queue is a priority queue, i.e., a "higher priority" commit floats up to the front of the queue and gets examined first; a "lower priority" commit that is already in the queue gets pushed towards the back of the line by this higher priority commit. The git log command thus handles just one commit at a time out of this queue.

    The queue itself is loaded, initially, from any commits you specify on the command line. For instance, you can run:

    git log branch1 branch2 branch3
    

    This uses git rev-parse to turn each of branch1, branch2, and branch3 into a commit hash ID. The resulting three commit hash IDs—assuming we get three different ones—are loaded into the queue. If we get duplicates, the queue has two or even just one commit hash ID in it, at this point. For instance, if the names branch2 and branch3 select the same commit, while branch1 selects a different commit, the queue now has just two commits in it.

    (If you don't pick any starting commit, git log will use HEAD as the starting point commit. Its sister command, git rev-list, doesn't have this particular feature, so any time you use git rev-list instead of git log, make sure you give it an explicit starting point.)

    The git log code now enters its main loop. This loop:

    • takes the top-priority commit from the queue;
    • decides whether to print it, based on some git log arguments; and
    • decides whether to put its parent commit(s) into the queue, based on other git log arguments.

    When we ask git log to say things about a file like .xyz.yml, the decision about whether to print the commit has to compare the commit's snapshot to its parent's snapshot. We now want to scan down a bit in the documentation to this section:

    A more detailed explanation follows.

    Suppose you specified foo as the <paths>. We shall call commits that modify foo !TREESAME, and the rest TREESAME. (In a diff filtered for foo, they look different and equal, respectively.) [snip]

    (Read the rest and work through their example, too, either before or after reading the rest of this answer.)

    What Git is really going to do, internally, is take the snapshot for this commit—whatever it is—and strip away all files except those you listed. In this case, the one file you listed is .xyz.yml; in their example, the one file is named foo instead. But you can give a directory path here, and Git will strip away all files except those that are in that directory, or multiple paths, and Git will strip away all but those paths, too. This all works for the so-called TREESAME test. It's just easiest to understand when we're looking at one single file, because either the commit has some particular version of the file, or the commit lacks the file entirely: those are the only two possibilities. So two commits are "the same" (TREESAME) if both lack the file, or if both have the file and use the same version of the file.

    If we have a normal, everyday, non-merge commit with a single parent commit, this is all pretty straightforward. Consider the following simple chain of commits:

    ... <-F <-G <-H
    

    Here, commit H has some snapshot. H's parent, commit G, has some snapshot. G's parent F has some snapshot too, of course, and so on down the line. Probably each snapshot is different, but if we strip them down to just one file of interest—file foo, or file .xyz.yml—commit G and H may have the same file. G and H are TREESAME to each other. The copy in commit F, however, might be different: F and G are not TREESAME.

    What this means is that Git won't mention commit H. It has no change to the file. Git will mention commit G: it has a change to the file, as compared to its parent F. This is the first use of the TREESAME concept: by asking Git about particular files, it only prints commits that are not TREESAME to their parent commit: that at least one of the files we're asking about, changed.

    Merge commits are tricky

    This only handles simple, ordinary commits like F, G, and H. What about merge commits? Our branch might have these commits in it:

           I--J
          /    \
    ...--H      M--N--...
          \    /
           K--L
    

    When Git is doing the TREESAME test for the (M, N) pair, that part is straightforward. Although M is a merge commit, it has a snapshot, just like any commit. So we reduce the snapshots in M and N to the file(s) of interest and decide whether the result is TREESAME. If so, we don't print N, and move on to M; if not, we do print N, and move on to M.

    Now we have to decide if commit M is TREESAME to its parent. But hang on, M does not have a parent. M has two parents, J and L. Which one should we compare?

    Git's answer is to compare all of them: to try a TREESAME(J, M) and a TREESAME(L, M). Git now knows whether M is TREESAME to all parents, or to some parents, or to no parents. If M is TREESAME to any parent, it is not printed; otherwise, it is printed. Now the real complication sets in.

    At a merge commit, git log can put some or all parents into the queue

    Having printed or not printed commit M, git log must now decide:

    • Do I put commit J into the queue?
    • Do I put commit L into the queue?

    When not doing history simplification, Git will put both parents into the queue. (Well, not if you used the --first-parent option. But, since you didn't, we'll just ignore the option entirely.) But when doing history simplification, the default option is:

    • With --full-history, all parents go into the queue.
    • Without --full-history, pick one parent that is TREESAME (chosen at random from all possible TREESAME parents). If no parent is TREESAME, pick all parents. Put these into the queue.

    (Note that some merge commits might have 3 or more parents; the same rules apply to these many-parent merge commits. Here we only have a two-parent merge, so the phrase "all parents" means "both parents".)

    Now, suppose our file-of-interest is introduced in commit K or L. It's absent from commits H, I, and J, and—importantly—it's absent in M as well: the merge omits the file. Since commit M, after stripping away all but our one file-of-interest, is TREESAME to commit J after the same stripping-away, Git follows M back to J, completely ignoring commit L. (Note: it could be that our file is also absent from L, but for whatever reason, Git chooses to follow commit J instead of L as its single TREESAME commit, while doing history simplification.)

    In this case, the history simplification code completely stops git log from looking at the bottom row of commits. The queue never contains commit K, where the file is first introduced. A scan to find the file never finds it, because Git never peruses the history—the commit—in which the file is introduced.

    The goal of history simplification

    The idea behind this simplification is to explain why you have the files you have now. By not following the history from merge commit M back to commit L, in our example, we never find the file .xyz.yml. But that's what we want, because file .xyz.yml is not in the current commit N, or wherever it is that we started. We've asked Git to explain the files that are there. File .xyz.yml isn't there and therefore the explanation as to why it's there is that it was never there in the history-of-interest: the history that explains why it's still not there.

    Your goal is different

    Your goal, of course, is to figure out where it was introduced and where it was lost. The fact is that it was lost at a merge, when someone decided: We don't need this stupid .xyz.yml in my merge result! Let's keep it out! That is, it was in some commit L, and isn't in its immediate successor merge M.

    The way I know this is from your final git log output, when you took out the --diff-filter option:

    git log --full-history -- .xyz.yml
    

    We see a commit that adds the file, and some commits that modify it, but we don't see any commit that deletes it. The reason we do not see this commit is because our merge M is TREESAME to at least one of its parents: J, in our example. So merge commit M is simply not printed.

    If it were printed, we'd still have a potential issue, because the way a merge is printed is a little funky. All commits have their hash ID and log message printed. If you ask for --name-status or --patch, you may also get the result of a git diff of some sort. For ordinary commits, this is a diff against the (single) parent commit. For git log, though, there's a problem:

    • if you don't ask for -c, --cc, or -m, git log lazily skips printing the diff entirely; and
    • if you do ask for -c or --cc, you get a combined diff.

    Combined diffs omit some files. In particular, they omit any file where at least one parent and the merge have the same version of the file—or in this case, both lack the file. So a combined diff won't mention that between L and M, the file got deleted. Only the -m style diff will mention the deletion here.

    The -m option does a "virtual split" of the merge commit. If merge X has parents P1, P2, P3, ..., Pn, you get n diffs: P1-vs-X, P2-vs-X, P3-vs-X, and so on up through Pn-vs-X. For our particular case, then, we would get two diffs: J vs M, and L vs M. The J-vs-M diff would show nothing at all for .xyz.yml, but the L-vs-M would show the deletion.

    (Note that -m also modifies the way git log decides whether to print the merge at all: now that it's been split, it gets printed if it's not TREESAME to at least one parent. That's important too, here.)

    The bottom line

    If you're trying to figure out where some file got deleted, you may need git log --full-history --diff-filter=D -m -- path. This forces git log to go through all parents of each merge and to inspect to see whether the merge itself is the reason the file doesn't exist in the commit from which you're starting.