Search code examples
pythongitoptimizationdiffgitpython

Fast way to get number of changed/deleted lines per subset of files per commit


I am doing some source code analysis with GitPython. For each commit I look at the contents of each .java file like this:

from git import Repo 
repo = Repo.init('/path/to/repo', bare=True)
for commit in repo.iter_commits():
    for obj in commit.tree.traverse(predicate = lambda obj, depth: obj.type == 'blob' and obj.name.endswith('.java')):
        content = obj.data_stream.read().decode("CP437")
        #...

This works already really fast and reliable. However, when I try to additionally get the number of changed and deleted lines of code for those files, it gets much slower. More precisely, I tried commit.stats.files which internally calls git diff --numstat. This is basically exactly what I want (I can easily filter for .java files), but for a repo where the above code takes ~5s, adding commit.stats.files increases time to ~140s which is infeasible for larger repos.

So my question is: Do you have ideas for a clever and fast way of getting the diff lines of code for all .java files for all commits?
I do not need the full diffs, just the number of lines...

It would be nice to not increase the run time of the old code by more than factor 2.


Solution

  • Try running git log --numstat or git log --numstat > stats.txt in your terminal.

    If running time is acceptable :

    • check if running the same command from gitpython is ok perfromance wise too (git = new Git(...); git.log('-numstat', ...) (if not acceptable : run the command using os.exec ...)
    • parse the output of this command