Search code examples
libgit2

Best way to read a repository multiple times at the same time using libgit2, performance/memory wise?


I'm using libgit2, I'd like to read two different branches at the same time and put them in two different lists.

I'm worried about the performance/memory consumption in that case.

Quoting the git_revwalk_new API page on the libgit2 website:

This revision walker uses a custom memory pool and an internal commit cache, so it is relatively expensive to allocate.

For maximum performance, this revision walker should be reused for different walks.

This revision walker is not thread safe: it may only be used to walk a repository on a single thread; however, it is possible to have several revision walkers in several different threads walking the same repository.

My initial approach, was to use two walkers for the two lists, each on different thread. They would walk the repository at the same time, each walker would include the commits for the targeted branch of the list.

However, from my understanding of the quote from the website, and please correct me if I'm wrong, allocating new RevisionWalker is expensive so it might consume a lot of memory, in case of large repositories. Also that the walker uses it's internal cache so re-reading and looking up commits would be faster if we used the same walker multiple times.

So my second thought was to only using one walker synchronously on the same thread. in which I include the commits of the first branch, begin reading the commits, and put them in the first list. Then I reset the walker again, include the commits for the second branch, re-read the repository commits and put them in the second list.

I tried both approaches, memory wise, there wasn't much difference, both used approximately the same memory. And performance wise, there wasn't much difference too.

So what do you recommend in this case? or is there any better solution?


Solution

  • A git_revwalk shouldn't be that expensive to build, realistically. And it sounds like you've evaluated the two options and determined that really there's no functional difference. So I would encourage you to use whichever one is code-wise the simplest, easiest to maintain, and easy to reason about.

    Usually that's the one without threads, but that's just a guess.

    The other advantage to doing two calls serially is that you may have a case where you can optimize the second revision walk based on the results of the first. For example, if you don't want to work on the commits in the second branch if you already saw them in the first, or you can optimize away whatever you're doing with them based on the knowledge that they were in the first branch, then git_revwalk_hideing those commits may make the overall computation more efficient.