I'm failing to see why is this running like sequentially process.
from queue import Queue, Empty
from concurrent.futures import ThreadPoolExecutor
import threading
import time
import random
pool = ThreadPoolExecutor(max_workers=3)
to_crawl = Queue()
#Import urls
for i in range(100):
to_crawl.put(str(i))
def scraping(random_sleep):
time.sleep(random_sleep)
return
def post_scrape(url):
print('URL %s finished' % url)
def my_crawler():
while True:
try:
target_url = to_crawl.get()
random_sleep = random.randint(1, 5)
print("Current URL: %s, sleep: %s" % (format(target_url), random_sleep))
executor = pool.submit(scraping(random_sleep))
executor.add_done_callback(post_scrape(target_url))
except Empty:
return
except Exception as e:
print(e)
continue
if __name__ == '__main__':
my_crawler()
Expected output:
Current URL: 0, sleep: 5
Current URL: 1, sleep: 1
Current URL: 2, sleep: 2
URL 1 finished
URL 2 finished
URL 0 finished
Real output:
Current URL: 0, sleep: 5
URL 0 finished
Current URL: 1, sleep: 1
URL 1 finished
Current URL: 2, sleep: 2
URL 2 finished
The problem is with the way you are calling pool.submit
:
pool.submit(scraping(random_sleep))
This says submit the result of scraping(random_sleep)
to the pool; actually, I'm surprised it doesn't cause an error. What you want to do is to submit the scraping
function with the argument random_sleep
, which is achieved with this:
pool.submit(scraping, random_sleep)
Similarly, the next line should be:
executor.add_done_callback(post_scrape)
And the callback should be declared as:
def post_scrape(executor):
Where executor
will be the Future itself, the executor
from the other code. Note there is no easy way to attach a user argument to this callback, so you can instead do something like this and drop the add_done_callback
:
def scraping(random_sleep, url):
time.sleep(random_sleep)
print('URL %s finished' % url)
return
#...
pool.submit(scraping, random_sleep, target_url)