Search code examples
pythonmultithreadingpython-3.xthreadpool

Add multithreading or asynchronous to web scrape


What is the best way to implement multithreading to make a web scrape faster? Would using Pool be a good solution - if so where in my code would I implement it?

import requests
from multiprocessing import Pool

with open('testing.txt', 'w') as outfile:
    results = []
    for number in (4,8,5,7,3,10):
        url = requests.get('https://www.google.com/' + str(number))
        response =(url)
        results.append(response.text)
        print(results)

    outfile.write("\n".join(results))

Solution

  • This can be moved to a pool easily. Python comes with process and thread based pools. Which to use is a tradeoff. Processes work better for parallelizing running code but are more expensive when passing results back to the main program. In your case, your code is mostly waiting on urls and has a relatively large return object, so thread pools make sense.

    I moved the code inside a if __name__ as needed on windows machines.

    import requests
    from multiprocessing import Pool
    from multiprocessing.pool import ThreadPool
    
    def worker(number):
        url = requests.get('https://www.google.com/' + str(number))
        return url.text
    
    # put some sort of cap on outstanding requests...
    MAX_URL_REQUESTS = 10
    
    if __name__ == "__main__":
        numbers = (4,8,5,7,3,10)
        with ThreadPool(min(len(numbers), MAX_URL_REQUESTS)) as pool:
            with open('testing.txt', 'w') as outfile:
                for result in pool.map(worker, numbers, chunksize=1):
                    outfile.write(result)
                    outfile.write('\n')