Search code examples
gitgit-loggit-blame

To display changes to the first line of a csv file tracked by git, can git log be on one line when using the -L line argument?


I would like to show only changes to the column headers of a csv file tracked by git. I use the code in this nice answer by Kirill Müller. It works almost perfectly except that it repeats the lines even if the commit didn't actually change the first line of the file.

Reproducible code

cd /tmp/
mkdir test
cd test/
git init
echo "bla,bla" > table.csv
git add table.csv
git commit -m "version bla"
echo "bla,bli" > table.csv
git commit -am "version bli"
echo "1,2" >> table.csv
git commit -am "Add data"

Issue

user:/tmp/test$ FILE=table.csv
user:/tmp/test$ LINE=1
user:/tmp/test$ git log --format=format:%H $FILE | xargs -L 1 git blame $FILE -L $LINE,$LINE
e4a89a75 (user 2022-08-10 16:45:04 +0200 1) bla,bli
e4a89a75 (user 2022-08-10 16:45:04 +0200 1) bla,bli
^58b4b88 (user 2022-08-10 16:44:16 +0200 1) bla,bla

The issue is that the last commit appears twice, eventhought the first line wasn't changed.

Expected output

e4a89a75 (user 2022-08-10 16:45:04 +0200 1) bla,bli
^58b4b88 (user 2022-08-10 16:44:16 +0200 1) bla,bla

What I tried

The log part of the instruction currently uses format:%H

user:/tmp/test$ git log --format=format:%H table.csv
c51873404aa45fb50fcbd6bd7ea06ab1e9f22071
e4a89a75e48623a1d2967996e6de3a250607e6a5
58b4b88800dd57cb1ca0476f1b9939781af28600

I tried adding the L1,1: argument to the log section but it formats the log differently so that the output cannot work anymore as an input to xargs

user:/tmp/test$ git log --format=format:%H -L1,1:table.csv
e4a89a75e48623a1d2967996e6de3a250607e6a5
diff --git a/table.csv b/table.csv
--- a/table.csv
+++ b/table.csv
@@ -1,1 +1,1 @@
-bla,bla
+bla,bli

58b4b88800dd57cb1ca0476f1b9939781af28600
diff --git a/table.csv b/table.csv
--- /dev/null
+++ b/table.csv
@@ -0,0 +1,1 @@
+bla,bla

Putting the log on one line may not be possible when using -L according to this answer:

"[...] git log --oneline -L 10,11:example.txt does work (it does however output the full patch)."


Solution

  • (First, big thanks for the reproducer—it was helpful—but one note: watch out, your quotes got mangled into "smart quotes" instead of plain double quotes. I fixed them.)

    I would like to show only changes to the column headers of a csv file tracked by git.

    Based on the example, by "column headers" I take it you mean "line 1".

    The basic problem starts here:

    git log --format=format:%H $FILE | ...
    

    This finds, and prints the hash ID of, each occurrence of a commit that changes anything in the given file. (FILE needs to be set to table.csv here.) This is not at all what you want! Its only function is to completely skip any commit where the file is entirely un-changed (which could be a useful function in real world examples, but not so much in your reproducer since every commit changes the file here.)

    (Side note: whenever it's possible, use git rev-list instead of git log. It's possible here. However, we're going to end up discarding git log / git rev-list anyway. But see footnote / separate section below.)

    ... | xargs -L 1 git blame $FILE -L $LINE,$LINE
    

    (Here, LINE needs to be set to 1.) The general idea here seems to be to run git blame on one specific line (in this case line 1), which is fine as far as it goes, but isn't really want we want. If our left-side command, git log ... $FILE, had selected just the revisions we want, those would already be the revisions we want and we could just stop here.

    The real trick here is to run git blame repeatedly but only until the blame "runs out". Each invocation of git blame should tell us who / which commit is "responsible for" (i.e., produced this version of) the given line, and that's exactly what git blame does. You give it a starting (ending?—Git works backwards, so we start at the end and work backwards) revision, and Git checks that version and the previous commit to see if the line in question changed in that version. If so, we're done: we print that version and the line. If not, we put the previous version in place and repeat. We do this until we run out of "previous versions", in which case we just print this version and stop.

    So git blame is already doing what you want. The only problem is that it stops after it finds the "previous version" to print. So what we really want is to build a loop:

    do {
        rev, other-info, output = <what git blame does>
        print rev and/or output in appropriate format
    } while other-info says there are previous revs
    

    The way to deal with this is to use --porcelain (or --incremental but --porcelain seems most appropriate here). We know that -L 1,1 (or -L $LINE,$LINE) is going to output a single line at the end. We want to collect the remaining lines. The output from --porcelain is described in the documentation: it's a series of lines with, in our case, the first and last being of interest, and the middle ones might be interesting, or might not, except that previous or boundary is always of interest.

    Shell parsing is kind of messy, so it's probably best to use some other language to handle the output from git blame. For instance, we might use a small Python program. This one doesn't have many features but shows how to use --porcelain here, and should be easy to modify. It has been very lightly tested (and run through black for formatting and mypy for type checking, but definitely needs better error handling. For instance, running it with a nonexistent pathname gets you a fatal error message, but then a Python traceback. I leave the cleanup to someone else, at this point.

    #! /usr/bin/env python3
    
    """
    Analyze "git blame" output and repeat until we reach the boundary.
    """
    
    import argparse
    import subprocess
    import sys
    
    
    def blame(path: str, args: argparse.Namespace) -> None:
        rev = "HEAD"
        while True:
            cmd = [
                "git",
                "blame",
                "--porcelain",
                f"-L{args.line},{args.line}",
                rev,
                "--",
                path,
            ]
            # if args.debug:
            #    print(cmd)
            proc = subprocess.Popen(
                cmd, shell=False, universal_newlines=True, stdout=subprocess.PIPE,
            )
            assert proc.stdout is not None
            info = proc.stdout.readline().split()
            rev = info[0]
            kws = {}
            match = None
            for line in proc.stdout:
                line = line.rstrip("\n")
                if line.startswith("\t"):
                    # here's our match, there won't be anything else
                    match = line
                else:
                    parts = line.split(" ", 1)
                    kws[parts[0]] = parts[1] if len(parts) > 1 else None
            status = proc.wait()
            if status != 0:
                print(f"'{' '.join(cmd)}' returned {status}")
    
            # found something useful
            print(f"{rev}: {match}")
            if "boundary" in kws:
                break
            prev = kws["previous"]
            assert prev is not None
            parts = prev.split(" ", 1)
            assert len(parts) == 2
            rev = parts[0]
            path = parts[1]
    
    
    def main() -> int:
        parser = argparse.ArgumentParser("foo")
        parser.add_argument("--line", "-l", type=int, default=1)
        parser.add_argument("files", nargs="+")
        args = parser.parse_args()
        for path in args.files:
            blame(path, args)
        return 0
    
    
    if __name__ == "__main__":
        try:
            sys.exit(main())
        except KeyboardInterrupt:
            sys.exit("\nInterrupted")
    

    [Edit: this program badly needs a few checks for when Git doesn't run or git blame does not find the file or line. In particular proc.stdout.readline() gets end-of-file and returns an empty string. Use with caution, fix it up, or don't use it at all.]

    Using git log directly

    This may not have the output format you want, but note that git log can do just what you want without having to write a bunch of new code:

    git log --oneline -L1,1:table.csv
    

    (or leave out the --oneline if you like). The -L directive takes two line numbers and a file name, or various other option formats, and does the same "find commits that modify the file" search that you were using git log table.csv for in the first place, but restricts the output still further, to show only those files where the specified lines change.

    Add --no-patch and an appropriate set of format directives, and you can get the commit hash IDs and whatever else you like, and then use some program to extract the lines from the specific files (e.g., git cat-file -p rev:path | sed -n -e "$line{p;q;}").

    Note that git log is what Git calls a porcelain command (vs git rev-list or git blame --porcelain acting as what Git calls a plumbing command). Porcelain commands generally obey Git configurations, such as the settings for color.ui, core.pager, and log.pager, and settings like log.decorate. This makes them hard to use from other programs, as it's hard to know whether something will be colorized (with ESC [ 31 m sequences for instance). Plumbing programs behave in a well-defined manner so that other programs can know exactly what input to expect. This is why we normally want to use git rev-list rather than git log when writing scripts, if we're doing something that both commands can do.