Search code examples
pythonmultithreadingpython-requestsconcurrent.futures

threading: function seems to run as a blocking loop although i am using threading


I am trying to speed up web scraping by running my http requests in a ThreadPoolExecutor from the concurrent.futures library.

Here is the code:

import concurrent.futures
import requests
from bs4 import BeautifulSoup


urls = [
        'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=ibfxcfd&showcategories=CFD',
        'https://www.interactivebrokers.eu/en/index.php?f=41634&exch=chix_ca',
        'https://www.interactivebrokers.eu/en/index.php?f=41634&exch=tase',
        'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=chixen-be&showcategories=STK',
        'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=bvme&showcategories=STK'
        ]

def get_url(url):
    print(url)
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    a = soup.select_one('a')
    print(a)


with concurrent.futures.ThreadPoolExecutor(max_workers=12) as executor:
    results = {executor.submit( get_url(url)) : url for url in urls}

    for future in concurrent.futures.as_completed(results):
        try:
            pass
        except Exception as exc:
            print('ERROR for symbol:', results[future])
            print(exc)

However when looking at how the scripts print in the CLI, it seems that the requests are sent in a blocking loop.

Additionaly if i run the code by using the below, i an see that it is taking roughly the same time.

for u in urls:
    get_url(u)

I have add some success in implementing concurrency using that library before, and i am at loss regarding what is going wrong here.

I am aware of the existence of the asyncio library as an alternative, but I would be keen on using threading instead.


Solution

  • You're not actually running your get_url calls as tasks; you call them in the main thread, and pass the result to executor.submit, experiencing the concurrent.futures analog to this problem with raw threading.Thread usage. Change:

    results = {executor.submit( get_url(url)) : url for url in urls}
    

    to:

    results = {executor.submit(get_url, url) : url for url in urls}
    

    so you pass the function to call and its arguments to the submit call (which then runs them in threads for you) and it should parallelize your code.