Search code examples
pythonpython-3.xweb-scrapingconcurrent.futures

Can't use ThreadPoolExecutor in the right way when a function produces a single link


I've created a script using concurrent.futures library to do multithreading in order to execute the script faster. If the first function get_content_url() within the script produced multiple links, the current implementation would work. However, as the first function is producing a single link, I don't understand how to use concurrent.futures in such cases.

To let you understand what the first function is doing - when I supply ids from a csv file to this function get_content_url(), it generates a single link by using the token collected from json response.

How can I apply concurrent.futures within the script in the right way to make the execution faster?

I've tried with:

import requests
import concurrent.futures
from bs4 import BeautifulSoup

base_link = "https://www.some_website.com/{}"
target_link = "https://www.some_website.com/{}"

def get_content_url(item_id):
    r = requests.get(base_link.format(item_id['id']))
    token = r.json()['token']
    content_url = target_link.format(token)
    yield content_url

def get_content(target_link):
    r = requests.get(target_link)
    soup = BeautifulSoup(r.text,"html.parser")
    try:
        title = soup.select_one("h1#maintitle").get_text(strip=True)
    except Exception: title = ""
    print(title)

if __name__ == '__main__':
    with open("IDS.csv","r") as f:
        reader = csv.DictReader(f)
        with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
            for _id in reader:
                future_to_url = {executor.submit(get_content,item): item for item in get_content_url(_id)}
                concurrent.futures.as_completed(future_to_url)

Solution

  • This might be a bit hard to reproduce, since I don't know what's inside the IDS.csv and a valid url case is missing in your question but here's something to play with:

    import csv
    import random
    
    import requests
    import concurrent.futures
    from bs4 import BeautifulSoup
    
    base_link = "https://www.some_website.com/{}"
    target_link = "https://www.some_website.com/{}"
    
    
    def get_content_url(item_id):
        url = base_link.format(item_id)
        print(f"Requesting {url}...")
        token = requests.get(url).json()['token']
        return target_link.format(token)
    
    
    def get_content(item_id):
        url = get_content_url(item_id)
        print(f"Fetching {url}...")
        r = requests.get(url)
        soup = BeautifulSoup(r.text, "html.parser")
        try:
            title = soup.select_one("h1#maintitle").get_text(strip=True)
            return title
        except Exception as exc:
            return exc
    
    
    def write_fake_ids():
        fake_ids = [
            {"item": "sample_item", "item_id": _} for _ in 
            random.sample(range(1000, 10001), 1000)
        ]
        with open("IDS.csv", "w") as output:
            w = csv.DictWriter(output, fieldnames=fake_ids[0].keys())
            w.writeheader()
            w.writerows(fake_ids)
    
    
    def get_ids():
        with open("IDS.csv") as csv_file:
            ids = csv.DictReader(csv_file)
            yield from (id_ for id_ in ids)
    
    
    if __name__ == '__main__':
        with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
            future_to_url = {
                executor.submit(get_content, id_['item_id']): id_ for id_ in get_ids()
            }
            for future in concurrent.futures.as_completed(future_to_url):
                print(future.result())
    

    I'm mocking the .csv file with write_fake_ids(). You can ignore it or remove it, it doesn't get called anywhere in the code.