Search code examples
gitgithubgraphqlversion-control

What is the most efficient way to find common ancestors on a large remote GitHub repository?


Methods that I have explored:

So, there are dozens of ways to interact with a GitHub repository. Methods include Github GraphQL, GitHub REST API, git clone/fetch-ing, and interacting with the Github's underlying git services directly.

I have found various ways to form trees of commit SHAs, but almost every method I have found has inefficiencies, that I can't seem to resolve.

Goal:

I simply want to find the SHA of the common ancestor of 2 branches.

What I know:

Branches in git are like pointers to commit shas. The branch itself doesn't actually track what commits are on the branch, just what commit is at the tip of the branch. The commits then keep track of their parents. So, the conventional way to determine the common ancestor of 2 branches is to iterate over each parent commit starting from the branch tip, until you find a commit that is shared by both branches. This is how git merge-base works.

The problem with git merge-base is that you have to at a minimum perform a git fetch on the repository. This is inefficient, because git fetch downloads more than just the commit SHAs from the origin.

REST and GraphQL Problem:

So, I was thinking of using GitHub's REST and or GraphQL APIs. The problem with both:

  • neither have a built-in merge-base method (at least as far as I know), so I would have to iterate over the commit histories of each branch manually, which is not terrifically efficient, especially if I am using REST.

The particular repository I'm trying to run merge-base is chromium/chromium which is an absolutely massive repository.

Inefficiencies caused by large-scale repositories:

I am trying to find the common ancestors of the main branch and the feature-branches associated with each Git tag. Sounds pretty simple, except for the fact that there are almost 27,000 tags and associated feature-branches on the chromium/chromium repository.

So, with that said, trying to manually run a merge-base on the repository with a simple GraphQL Query is still far too inefficient. With a rate limit of 500,000 (5000 points), At 27,000 feature-branches, I could theoretically only iterate over 20 commits per branch before hitting the GraphQL rate limit. To make that worse, I only expect the amount of tags to increase on that repository over time.

REST is out of question (as far as I know anyway), because the rate limits for REST are even further restrictive.

Possible alternative solution:

I am unsure if there is a way to optimize the methods that I have explored above, but I have also considered interacting with the underlying communication services.

I believe git communicates at the Application Layer, using HTTP to communicate with repositories with custom HTTP headers. This method would allow me to interact, not only with Chromium's GitHub mirror, but the actual repository hosted by Piper/Gitiles.

However, I am unsure if there is a method to manually merge-base with that. I was wondering if it is possible to just download the commit SHAs (and their parents) given a target branch and no other metadata, but I would have no idea how to do that or if that would be efficient. I can't seem to find any good documentation on the underlying communication services offered by git repos.


Solution

  • This takes about one minute on my machine:

    git clone --bare --filter=tree:0 https://github.com/chromium/chromium chromium-bare
    

    Now the repository takes about 764M.

    --filter=tree:0 avoids trees and blobs (suggested by Benjamin W.). --bare means that we don’t have to check out a working tree.

    This takes 19 seconds:

    git commit-graph write
    

    Let’s try to find an old tag:

    git tag --sort=version:refname
    

    I don’t know how Chromium is versioned so I’m just guessing that these tags can be sorted in this way. The tag 3.0.195.25 is from 2014.

    This takes about 1.3s:

    git merge-base main 3.0.195.25
    

    Let’s say that that is a worst case. With about 27,000 tags it should take at most 1.3 * 27000 = 35100 seconds to compute all the merge bases (assuming that the runtime of the rest of the code is negligible), or 9¾ hours.

    Comparing xargs with (GNU) parallel

    Computing the merge base between master and two hundred tags takes about 2m18s:

    git for-each-ref --format='%(refname:short)' 'refs/tags/**' --count=200 | xargs -n 1 git merge-base master
    

    I warmed up the disk (hopefully) by running the command one time before the invocation.

    Same procedure with parallel instead:

    git for-each-ref --format='%(refname:short)' 'refs/tags/**' --count=200 | parallel -n 1 git merge-base master
    

    That takes 28s.

    My CPU has four cores and eight threads.

    That takes the total runtime down from 9¾ hours to approximately 1 hour and 10 minutes.