Search code examples
pythonmultithreadingpython-multithreading

Python - Multithreading is running sequentially


I'm failing to see why is this running like sequentially process.

from queue import Queue, Empty
from concurrent.futures import ThreadPoolExecutor
import threading
import time
import random

pool = ThreadPoolExecutor(max_workers=3)
to_crawl = Queue()

#Import urls
for i in range(100):
    to_crawl.put(str(i))

def scraping(random_sleep):
    time.sleep(random_sleep)
    return

def post_scrape(url):
    print('URL %s finished' % url)

def my_crawler():
    while True:
        try:
            target_url = to_crawl.get()
            random_sleep = random.randint(1, 5)
            print("Current URL: %s, sleep: %s" % (format(target_url), random_sleep))
            executor = pool.submit(scraping(random_sleep))
            executor.add_done_callback(post_scrape(target_url))
        except Empty:
            return
        except Exception as e:
            print(e)
            continue

if __name__ == '__main__':
    my_crawler()

Expected output:

Current URL: 0, sleep: 5
Current URL: 1, sleep: 1
Current URL: 2, sleep: 2
URL 1 finished
URL 2 finished
URL 0 finished

Real output:

Current URL: 0, sleep: 5
URL 0 finished
Current URL: 1, sleep: 1
URL 1 finished
Current URL: 2, sleep: 2
URL 2 finished

Solution

  • The problem is with the way you are calling pool.submit:

    pool.submit(scraping(random_sleep))
    

    This says submit the result of scraping(random_sleep) to the pool; actually, I'm surprised it doesn't cause an error. What you want to do is to submit the scraping function with the argument random_sleep, which is achieved with this:

    pool.submit(scraping, random_sleep)
    

    Similarly, the next line should be:

    executor.add_done_callback(post_scrape)
    

    And the callback should be declared as:

    def post_scrape(executor):
    

    Where executor will be the Future itself, the executor from the other code. Note there is no easy way to attach a user argument to this callback, so you can instead do something like this and drop the add_done_callback:

    def scraping(random_sleep, url):
        time.sleep(random_sleep)
        print('URL %s finished' % url)
        return
    
    #...
    
    pool.submit(scraping, random_sleep, target_url)