Code:
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
for i in range(r):
processes.append(executor.submit(scrape, i))
for _ in concurrent.futures.as_completed(processes):
offers += _.result()
print('total:', len(offers))
The scrape function looks something like that:
def scrape(i):
requests.get(f'somepage.com/page{i}')
//use bs4 to get the offers
print(len(offers))
return offers
I have this piece of code setup. The scrape function scrapes a website with page i, and returns a list of links to offers. This function also prints the length of the list, just for debugging purposes. When I run my code, it goes well for the first couple pages, printing the total: len(offers), but after that it doesn't run the 'total:' print, only going with the print in the scrape function. This is the output. The expected output would be something like
total: 120
120
total: 240
120
total: 360
etc.
I'll gladly accept any help, it's my first time working with concurrent stuff in python and also first time using stack overflow to ask a question.
Maybe this will help you to understand the threads.
each thread will take its own processing time and will return once it's complete. So you will not see results in sequence like print(len(offers))
and then print('total:', len(offers))
.
just to test this, suppose we remove requests and tweak as follows:
import concurrent.futures
r=10
processes=[]
offers=""
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
for i in range(r):
processes.append(executor.submit(scrape, i))
for _ in concurrent.futures.as_completed(processes):
offers += _.result()
#print('total:', len(offers))
print('total:', offers)
print("*****")
and
def scrape(i):
print(f"scrape {i}")
return f"scrape return {i}"
you will notice that print(f"scrape {i}")
has been processed very early in the processing and then getting the result for print('total:', offers)
.
In this type of setup, we wait for their completion (the way you did) and then arrange the results as per expectation.