I am doing some source code analysis with GitPython. For each commit I look at the contents of each .java file like this:
from git import Repo
repo = Repo.init('/path/to/repo', bare=True)
for commit in repo.iter_commits():
for obj in commit.tree.traverse(predicate = lambda obj, depth: obj.type == 'blob' and obj.name.endswith('.java')):
content = obj.data_stream.read().decode("CP437")
#...
This works already really fast and reliable.
However, when I try to additionally get the number of changed and deleted lines of code for those files, it gets much slower. More precisely, I tried commit.stats.files
which internally calls git diff --numstat
. This is basically exactly what I want (I can easily filter for .java files), but for a repo where the above code takes ~5s, adding commit.stats.files
increases time to ~140s which is infeasible for larger repos.
So my question is: Do you have ideas for a clever and fast way of getting the diff lines of code for all .java files for all commits?
I do not need the full diffs, just the number of lines...
It would be nice to not increase the run time of the old code by more than factor 2.
Try running git log --numstat
or git log --numstat > stats.txt
in your terminal.
If running time is acceptable :
git = new Git(...); git.log('-numstat', ...)
(if not acceptable : run the command using os.exec ...
)