Search code examples
pythonweb-scrapingbeautifulsoupconcurrent.futures

How can I return the data I'm scraping when using beautifulsoup and concurrent.futures?


Trying to async scrape some recipes from nyt cooking and was following this blog: https://beckernick.github.io/faster-web-scraping-python/

It will print the results without a problem but for some reason my return does nothing here. I need to return the list. Any ideas?

import concurrent.futures
import time

MAX_THREADS = 30
urls = ['https://cooking.nytimes.com/search?q=&page={page_number}'.format(page_number=p) for p in range(1,5)]

# grab all of the recipe cards on each search page
def extract_recipe_urls(url):
    """returns a list of recipe urls"""
    recipe_cards = []
    response = session.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    for rs in soup.find_all("article",{"class":"card recipe-card"}):
        recipe_cards.append(rs.find('a')['href'])
    
    print(recipe_cards)
    
    return recipe_cards

def async_scraping(scrape_function, urls):
    threads = min(MAX_THREADS, len(urls))
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as executor:
        executor.map(scrape_function, urls)

Solution

  • You have to get

     results = executor.map(...)
    

    and later you can use loop

    for item in results:
        print(item)
    

    or convert to list

    all_items = list(results)
    

    BTW: Because results is a generator so you can't use it two times in two for-loops (or in for-loop and list()) and then you have to first get all items as list all_items = list(results) and later use this list all_items in two for-loops.


    Minimal working code:

    import requests
    from bs4 import BeautifulSoup
    import concurrent.futures
    import time
    
    # --- constants ---
    
    MAX_THREADS = 30
    
    # --- functions ---   
    
    # grab all of the recipe cards on each search page
    def extract_recipe_urls(url):
        """returns a list of recipe urls"""
        
        session = requests.Session()
    
        recipe_cards = []
        response = session.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
    
        for rs in soup.find_all("article",{"class":"card recipe-card"}):
            recipe_cards.append(rs.find('a')['href'])
        
        return recipe_cards
    
    def async_scraping(scrape_function, urls):
        threads = min(MAX_THREADS, len(urls))
        
        with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as executor:
            results = executor.map(scrape_function, urls)
            
        return results
    
    # --- main ---
    
    urls = ['https://cooking.nytimes.com/search?q=&page={page_number}'.format(page_number=p) for p in range(1,5)]
            
    results = async_scraping(extract_recipe_urls, urls)
    
    #all_items = list(results)
    
    for item in results:
        print(item)
    

    BTW: Every extract_recipe_urls gives you list so finally results is list of lists.

    all_items = list(results)
    print('len(all_items):', len(all_items))
          
    for item in all_items:
        print('len(item):', len(item))
    

    Results

    len(all_items): 4
    len(item): 48
    len(item): 48
    len(item): 48
    len(item): 48
    

    If you want all items as one flat list then you can use list1.extend(list2) or list1 + list2 which can be used with sum(..., [])

    all_items = sum(all_items, [])
    print('len(all_items):', len(all_items))
    

    Result:

    len(all_items): 192