Search code examples
gitgithubgithub-api

Get the metadata for the first N commits in a remote Git repository


Using the following GitHub API it is possible to get the metadata for the commits in a repository, ordered from the latest to the oldest

https://api.github.com/repos/git/git/commits

Is there a way to obtain similar metadata but in the reverse chronological order of commits, that is, starting with the oldest commits in the repository?

NOTE: I want to obtain such metadata without having to download the full repository.

Thanks


Solution

  • That's possible using a workaround using GraphQL API. This method is essentially the same as getting the first commit in a repo:

    Get the last commit and return the totalCount and the endCursor :

    {
      repository(name: "linux", owner: "torvalds") {
        ref(qualifiedName: "master") {
          target {
            ... on Commit {
              history(first: 1) {
                nodes {
                  message
                  committedDate
                  authoredDate
                  oid
                  author {
                    email
                    name
                  }
                }
                totalCount
                pageInfo {
                  endCursor
                }
              }
            }
          }
        }
      }
    }
    

    It returns something like that for the cursor and pageInfo object :

    "totalCount": 950329,
    "pageInfo": {
      "endCursor": "b961f8dc8976c091180839f4483d67b7c2ca2578 0"
    }
    

    I don't have any source about the cursor string format b961f8dc8976c091180839f4483d67b7c2ca2578 0 but I've tested with some other repository with more than 1000 commits and it seems that it's always formatted like:

    <static hash> <incremented_number>
    

    In order to iterate from the first commit to the newest, you will need to start from totalCount - 1 - <number_perpage>*<page> starting from page 1:

    For example in order to get the first 20 commits from the linux repository :

    {
      repository(name: "linux", owner: "torvalds") {
        ref(qualifiedName: "master") {
          target {
            ... on Commit {
              history(first: 20, after: "fc4f28bb3daf3265d6bc5f73b497306985bb23ab 950308") {
                nodes {
                  message
                  committedDate
                  authoredDate
                  oid
                  author {
                    email
                    name
                  }
                }
                totalCount
                pageInfo {
                  endCursor
                }
              }
            }
          }
        }
      }
    }
    

    Note that this total commit count change over time in this repo, so you need to get the total count value before running the query.

    Here is a example iterating the first 300 commits of the Linux repository (starting from the oldest):

    import requests
    
    token = "YOUR_ACCESS_TOKEN"
    
    name = "linux"
    owner = "torvalds"
    branch = "master"
    
    iteration = 3
    per_page = 100
    commits = []
    
    query = """
    query ($name: String!, $owner: String!, $branch: String!){
        repository(name: $name, owner: $owner) {
            ref(qualifiedName: $branch) {
                target {
                    ... on Commit {
                        history(first: %s, after: %s) {
                            nodes {
                                message
                                committedDate
                                authoredDate
                                oid
                                author {
                                    email
                                    name
                                }
                            }
                            totalCount
                            pageInfo {
                                endCursor
                            }
                        }
                    }
                }
            }
        }
    }
    """
    
    def getHistory(cursor):
        r = requests.post("https://api.github.com/graphql",
            headers = {
                "Authorization": f"Bearer {token}"
            },
            json = {
                "query": query % (per_page, cursor),
                "variables": {
                    "name": name,
                    "owner": owner,
                    "branch": branch
                }
            })
        return r.json()["data"]["repository"]["ref"]["target"]["history"]
    
    #in the first request, cursor is null
    history = getHistory("null")
    totalCount = history["totalCount"]
    if (totalCount > 1):
        cursor = history["pageInfo"]["endCursor"].split(" ")
        for i in range(1, iteration + 1):
            cursor[1] = str(totalCount - 1 - i*per_page)
            history = getHistory(f"\"{' '.join(cursor)}\"")
            commits += history["nodes"][::-1]
    else:
        commits = history["nodes"]
    
    print(commits)