Search code examples
gitgit-loggit-blame

How to find all the "active" git commits in a tree?


I'd like to get a snapshot of "active" git commits has for a directory tree, meaning git commits that really are part of the build and not commits that have been fully superseded by newer commits.

I can do this by running git blame on every file and extracting the commits that way, but it's too slow to be practical on a large repo.


Solution

  • What git blame does is pretty much the only way to find the information you're looking for. However, you can simplify the action somewhat, and that might be enough for your purposes and perhaps that would be fast enough as well.

    Remember, every commit has a full snapshot of every file. A branch name identifies the last commit in some chain of commits. So when you have:

    ... <-F <-G <-H   <-- branch
    

    the name branch holds the raw hash ID of commit H. In commit H, there are many files, each of which has many lines. Those files are in the form they have in commit H, and that's all there is to it—except that commit H contains the hash ID of earlier commit G.

    You can use hash ID this to locate commit G and extract all of its files, and when the file in G completely matches the file in H, that means that—in git blame terms at least—all the lines in the file in G are attributable to G, if not to some earlier commit. So files that are different in G and H should be attributed to H. The git blame command works on a line-by-line basis, attributing individual lines to commit H if they differ, but perhaps for your purposes, attributing the entire file to H suffices.

    Should you decide that the file should perhaps be attributed to commit G, it is now time to extract commit F's hash ID from commit G, and use that to read all the files from commit F. If any given file in F matches the copy in G, the attribution moves back to F; otherwise it remains at G.

    You must repeat this process until you run entirely out of commits:

    A <-B <-C ... <-H
    

    Since commit A has no parent, any files in A that are unchanged all the way through the last commit are to be attributed to commit A. You can, however, stop traversing backwards as soon as you have completely attributed all files that exist in H to some commit later in the chain. Compare this to git blame, which must keep looking backwards as long as at least one line is attributed to some earlier commit: you'll probably stop long before git blame must.

    Moreover, because of Git's internal data structures, it is very fast to tell whether a file in some earlier commit exactly matches a file of the same name in some later one: every file in every commit is represented by a hash ID. If the hash ID is the same, the file's contents are bit-for-bit identical in the two commits. If not, they're not.

    There is no convenient in-Git command to do exactly what you want,1 and if you do intend to traverse the history like this, you must decide what to do with merges. Remember that a merge commit has a snapshot, but unlike a non-merge, has two or more parents:

    ...--o--K
             \
              M--o--o--...--o   <-- last
             /
    ...--o--L
    

    Which commit(s) should you follow, if the file in M matches one or more of the files in K and/or L? The git log command has its own method of doing this—git log <start-point> -- <path> will simplify history by following one parent, chosen at random from the set of such parents, that has the same hash ID for the given file.

    Note that you can use git rev-list, perhaps with --parents, to produce the set of hash IDs that you can choose to examine. The rev-list command is the workhorse for most other Git commands, including git blame itself, for following history like this. (Note: the git log command is built from the same source as git rev-list, with some minor command-line-option differences and different default outputs.)


    1While git log <start-point> -- <path> is useful here, it will be too slow to run this once for each path, and it's not effective to run it without giving individual paths.