Search code examples
pythonpandasgithubgithub-api

Get license link from Github with specific commit hash


I have a table (as a Pandas DF) of (mostly) github repos, for which I need to automatically extract the LICENSE link. However, it is a requirement that the link does not just simply go to the /blob/master/ but actually points to a specific commit as the master link might be updated at some point. I assembled a Python script to do this through the github API, but using the API I am only able to retrieve the link with the master tag.

I.e. instead of
https://github.com/jsdom/abab/blob/master/LICENSE.md
I want
https://github.com/jsdom/abab/blob/8abc2aa5b1378e59d61dee1face7341a155d5805/LICENSE.md

Any idea if there is a way to automatically get the link to the latest commit for a file, in this case the LICENSE file?

This is the code I have written so far:

def githubcrawl(repo_url, session, headers):
    parts = repo_url.split("/")[3:]
    url_tmpl = "http://api.github.com/repos/{}/license"
    url = url_tmpl.format("/".join(parts))
    try:
        response = session.get(url, headers=headers)
        if response.status_code in [404]:
            return(f"404: {repo_url}")
        else:
            data = json.loads(response.text)
            return(data["html_url"]) # Returns the html URL to LICENSE file
    except urllib.error.HTTPError as e:
        print(repo_url, "-", e)
        return f"http_error: {repo_url}"

token="mytoken" # Token for github authentication to get more requests per hour
headers={"Authorization": "token %s" % token}

session = requests.Session()
lizlinks = [] # List to store the links of the LICENSE files in

# iterate over DataFrame of applications/deps
for idx, row in df.iterrows():
#    if idx < 5:
        if type(row["Homepage"]) == type("str"):
            repo_url = re.sub(r"\#readme", "", row["Homepage"])
            response = session.get(repo_url, headers=headers) 
            repo_url = response.url # Some URLs are just redirects, so I get the actual repo url here
            if "github" in repo_url and len(repo_url.split("/")) >= 3:
                link = githubcrawl(repo_url, session, headers)
                print(link)
                lizlinks.append(link)
            else:
                print(row["Homepage"], "Not a github Repo")
                lizlinks.append("Not a github repo")
        else:
            print(row["Homepage"], "Not a github Repo")
            lizlinks.append("Not a github repo")

Bonus-Question: Would parallelizing this task work with the Github-API? I.e. could I send multiple requests at once without being locked out (DoS) or is the for-loop a good approach to avoid this? It takes quite a while to go through the 1000ish of repos I have in that list.


Solution

  • Ok, I found a way to get the unique SHA-hash of the current commit. I believe that should always link to the license file of that point in time.

    Using the python git library, i simply run the ls_remote git command and return the HEAD sha

    def lsremote_HEAD(url):
        g = git.cmd.Git()
        HEAD_sha = g.ls_remote(url).split()[0]
        return HEAD_sha
    

    I can then replace the "master", "main" or whatever tag in my github_crawl function:

    token="token_string"
    headers={"Authorization": "token %s" % token}
    session = requests.Session()
    def githubcrawl(repo_url, session, headers):
        parts = repo_url.split("/")[3:]
        api_url_tmpl = "http://api.github.com/repos/{}/license"
        api_url = api_url_tmpl.format("/".join(parts))
        try:
            print(api_url)
            response = session.get(api_url, headers=headers)
            if response.status_code in [404]:
                return(f"404: {repo_url}")
            else:
                data = json.loads(response.text)
                commit_link = re.sub(r"/blob/.+?/",rf"/blob/{lsremote_HEAD(repo_url)}/", data["html_url"])
                return(commit_link)
        except urllib.error.HTTPError as e:
            print(repo_url, "-", e)
            return f"http_error: {repo_url}"
    

    Maybe this helps someone, so I'm posting this answer here.

    This answer uses the following libraries:

    import re
    import git
    import urllib
    import json
    import requests