Search code examples
pythonweb-scrapinggrequests

Scraping by sending concurrent requests with python


I have python 3.4 and I installed requests and a few other necessary programs to web scrape. My problem is that I'd like to scrape about 7000 pages (just html/text), and don't want to do it all at once, I'd like to have some kind of delay so I don't hit the servers with too many requests and potentially get banned. I've heard of grequests but apparently they don't have it for python 3.4 (the actual error says it can't find vcvarsall.bat but in the documentation I didn't see any support for 3.4). Does anyone know of an alternative program that could manage the url requests? In other words, I'm not looking to grab everything fast as possible, but rather, slow and steady.


Solution

  • I suggest rolling your own multithreaded program to do requests. I found concurrent.futures to be the easiest way to multithread these kinds of requests, in particular using the ThreadPoolExecutor. They even have a simple multithreaded url request example in the documentation.

    As for the second part of the question, it really depends on how much/how you want to limit your requests. For me, setting a sufficiently low max_workers argument and possibly including a time.sleep wait in my function was enough to avoid any problems even when scraping tens of thousands of pages, but this obviously depends a lot more on the site you're trying to scrape. It shouldn't be hard to implement some kind of batching or waiting though.

    The following code is untested but hopefully it can be a starting point. From here, you probably want to modify get_url_data (or whatever function you're using) with whatever you else you need to do (e.g. parsing, saving).

    import concurrent.futures as futures
    import requests
    from requests.exceptions import HTTPError
    
    urllist = ...
    
    def get_url_data(url, session):
        try:
            r = session.get(url, timeout=10)
            r.raise_for_status()
        except HTTPError:
            return None
    
        return r.text
    
    s = requests.Session()
    
    try:
        with futures.ThreadPoolExecutor(max_workers=5) as ex:
            future_to_url = {ex.submit(get_url_data, url, s): url
                             for url in urlist}
    
        results = {future_to_url[future]: future.result() 
                   for future in futures.as_completed(future_to_url)}
    finally:
        s.close()