I am trying to extract git logs from a few repositories like this:
git log --pretty=format:%H\t%ae\t%an\t%at\t%s --numstat
For larger repositories (like rails/rails) it takes a solid 35+ seconds to generate the log.
Is there a way to improve this performance?
You are correct, it does take somewhere between 20 and 35 seconds to generate the report on 56'000 commits generating 224'000 lines (15MiB) of output. I actually think that's pretty decent performance but you don't; okay.
Because you are generating a report using a constant format from an unchanging database, you only have to do it once. Afterwards, you can use the cached result of git log
and skip the time-consuming generation. For example:
git log --pretty=format:%H\t%ae\t%an\t%at\t%s --numstat > log-pretty.txt
You might wonder how long it takes to search that entire report for data of interest. That's a worthy question:
$ tail -1 log-pretty.txt
30 0 railties/test/webrick_dispatcher_test.rb
$ time grep railties/test/webrick_dispatcher_test.rb log-pretty.txt
…
30 0 railties/test/webrick_dispatcher_test.rb
real 0m0.012s
…
Not bad, the introduction of a "cache" has reduced the time needed from 35+ seconds to a dozen milliseconds. That's almost 3000 times as fast.