What is the best way to implement multithreading to make a web scrape faster? Would using Pool be a good solution - if so where in my code would I implement it?
import requests
from multiprocessing import Pool
with open('testing.txt', 'w') as outfile:
results = []
for number in (4,8,5,7,3,10):
url = requests.get('https://www.google.com/' + str(number))
response =(url)
results.append(response.text)
print(results)
outfile.write("\n".join(results))
This can be moved to a pool easily. Python comes with process and thread based pools. Which to use is a tradeoff. Processes work better for parallelizing running code but are more expensive when passing results back to the main program. In your case, your code is mostly waiting on urls and has a relatively large return object, so thread pools make sense.
I moved the code inside a if __name__
as needed on windows machines.
import requests
from multiprocessing import Pool
from multiprocessing.pool import ThreadPool
def worker(number):
url = requests.get('https://www.google.com/' + str(number))
return url.text
# put some sort of cap on outstanding requests...
MAX_URL_REQUESTS = 10
if __name__ == "__main__":
numbers = (4,8,5,7,3,10)
with ThreadPool(min(len(numbers), MAX_URL_REQUESTS)) as pool:
with open('testing.txt', 'w') as outfile:
for result in pool.map(worker, numbers, chunksize=1):
outfile.write(result)
outfile.write('\n')