python multithreading python-3.x threadpool

Add multithreading or asynchronous to web scrape

What is the best way to implement multithreading to make a web scrape faster? Would using Pool be a good solution - if so where in my code would I implement it?

import requests
from multiprocessing import Pool

with open('testing.txt', 'w') as outfile:
    results = []
    for number in (4,8,5,7,3,10):
        url = requests.get('https://www.google.com/' + str(number))
        response =(url)
        results.append(response.text)
        print(results)

    outfile.write("\n".join(results))

Solution

This can be moved to a pool easily. Python comes with process and thread based pools. Which to use is a tradeoff. Processes work better for parallelizing running code but are more expensive when passing results back to the main program. In your case, your code is mostly waiting on urls and has a relatively large return object, so thread pools make sense.

I moved the code inside a if __name__ as needed on windows machines.

import requests
from multiprocessing import Pool
from multiprocessing.pool import ThreadPool

def worker(number):
    url = requests.get('https://www.google.com/' + str(number))
    return url.text

# put some sort of cap on outstanding requests...
MAX_URL_REQUESTS = 10

if __name__ == "__main__":
    numbers = (4,8,5,7,3,10)
    with ThreadPool(min(len(numbers), MAX_URL_REQUESTS)) as pool:
        with open('testing.txt', 'w') as outfile:
            for result in pool.map(worker, numbers, chunksize=1):
                outfile.write(result)
                outfile.write('\n')