Search code examples
pythongithubgithub-apigithub-search

How do I get all 1000 results using the GitHub Search API?


I understand that the GitHub Search API limits to 1000 results and 100 results per page. Therefore I wrote the following to view all 1000 results for a code search process that looks for a string torch-

import requests
for i in range(1,11):
    url = "https://api.github.com/search/code?q=torch +in:file + language:python&per_page=100&page="+str(i)

    headers = {
    'Authorization': 'xxxxxxxx'
    }

    response = requests.request("GET", url, headers=headers).json()
    try:
        print(len(response['items']))
    except:
        print("response = ", response)

Here is the output -

15
62
response =  {'documentation_url': 'https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits', 'message': 'You have exceeded a secondary rate limit. Please wait a few minutes before you try again.'}
  1. It seems to hit the secondary rate limit just after the second iteration
  2. The values in the pages aren't consistent. For instance, page 1 shows 15 results when I ran this time. However, if I run it again, it will be another number. I believe there should be 100 results per page.

Does there exist an efficient way to get all 1000 results from the Search API?


Solution

  • There's two things happening here:

    1. You are receiving incomplete results because the query is timing out.
    2. You are being rate limited.

    The search API has different rate limits. See the GitHub Documentation:

    The REST API for searching items has a custom rate limit that is separate from the rate limit governing the other REST API endpoints.

    I would recommend trying lower amounts of results per page to solve the incomplete results.

    You will also need to be very deliberate about the requests you're making, because the limits are low. Getting the full 1000 may be impossible without requesting a rate increase or a implementing a very long backoff.

    I modified your code to add a primitive exponential backoff, but this still doesn't produce the full 1000 results and takes a while:

    import requests
    import time
    
    headers = {
    'Authorization': 'token <TOKEN>'
    }
    
    results = []
    for i in range(1, 31):
        url = "https://api.github.com/search/code?q=torch +in:file + language:python&per_page=33&page="+str(i)
        backoff = 2 # backoff in seconds
        while backoff < 1024:
            time.sleep(backoff)
            try:
                response = requests.request("GET", url, headers=headers)
                response.raise_for_status() # throw an exception for HTTP 400 and 500s
                data = response.json()
                results.append(data['items'])
                print(f'Got {len(data["items"])} results for page {i}.')
                url = response.links['next']['url']
                break
            except requests.exceptions.RequestException as e:
                print('ERROR: Failed to make request: ', e)
                backoff **= 2
        if backoff >= 1024:
            print('ERROR: Backoff limit reached.')
            break