Search code examples
web-scrapingscrapygithub-api

How do I set up automatic change of the github token during parsing?


GitHub allows you to send no more than 2500 requests per hour if I have several accounts/tokens, how to set up an automatic token change in Scrapy when a certain level of requests is reached (for example 2500) or for the token to change when responding 403.?

class GithubSpider(scrapy.Spider):
    name = 'github.com'
    start_urls = ['https://github.com']
    tokens = ['token1', 'token2',  'token3', 'token4']
    headers = {
        'Accept': 'application/vnd.github.v3+json',
        'Authorization': 'token ' + tokens[1],
    }
    
    def start_requests(self, **cb_kwargs):
        for lang in languages:
            cb_kwargs['lang'] = lang
            url = f'https://api.github.com/search/users?q=language:{lang}%20location:{country}&per_page=100'
            yield Request(url=url, headers=self.headers,  callback=self.parse, cb_kwargs=cb_kwargs)

Solution

  • You could use the cycle function from the module itertools to create a generator using your list of tokens that you can then cycle through for each request you send to ensure you are using all the tokens equally thereby reducing chance of reaching the limit for any of the tokens.

    If you start receiving 403 responses then you will know that all the tokens have reached their limit. See sample code below

    from itertools import cycle
    
    class GithubSpider(scrapy.Spider):
        name = 'github.com'
        start_urls = ['https://github.com']
        tokens = cycle(['token1', 'token2',  'token3', 'token4'])
    
        def start_requests(self, **cb_kwargs):
            for lang in languages:
                headers = {
                    'Accept': 'application/vnd.github.v3+json',
                    'Authorization': 'token ' + next(self.tokens)
                }
                cb_kwargs['lang'] = lang
                url = f'https://api.github.com/search/users?q=language:{lang}%20location:{country}&per_page=100'
                yield Request(url=url, headers=headers,  callback=self.parse, cb_kwargs=cb_kwargs)