Search code examples
pythonpython-3.xweb-scrapingconcurrent.futures

Unable to print results from a function while using concurrent.futures in some customized way


I've created a script using concurrent.futures library to print the result from fetch_links function. When I use print statement inside the function, I get the results accordingly. What I wish to do now is print the result from that function using yield statement.

Is there any way I can modify things under main function in order to print the result from fetch_links function keeping it as is, meaning keeping the yield statement?

import requests
from bs4 import BeautifulSoup
import concurrent.futures as cf

links = [
    "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=2&pagesize=50",
    "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=3&pagesize=50",
    "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=4&pagesize=50"
]

base = 'https://stackoverflow.com{}'

def fetch_links(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select(".summary .question-hyperlink"):
        # print(base.format(item.get("href")))
        yield base.format(item.get("href"))

if __name__ == '__main__':
    with requests.Session() as s:
        with cf.ThreadPoolExecutor(max_workers=5) as exe:
            future_to_url = {exe.submit(fetch_links,s,url): url for url in links}
            cf.as_completed(future_to_url)

Solution

  • Your fetch_links is a generator, so you have to loop over that too, to get the results:

    import requests
    from bs4 import BeautifulSoup
    import concurrent.futures as cf
    
    links = [
        "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=2&pagesize=50",
        "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=3&pagesize=50",
        "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=4&pagesize=50"
    ]
    
    base = 'https://stackoverflow.com{}'
    
    
    def fetch_links(s, link):
        r = s.get(link)
        soup = BeautifulSoup(r.text, "lxml")
        for item in soup.select(".summary .question-hyperlink"):
            yield base.format(item.get("href"))
    
    
    if __name__ == '__main__':
        with requests.Session() as s:
            with cf.ThreadPoolExecutor(max_workers=5) as exe:
                future_to_url = {exe.submit(fetch_links, s, url): url for url in links}
                for future in cf.as_completed(future_to_url):
                    for result in future.result():
                        print(result)
    

    Output:

    https://stackoverflow.com/questions/64298886/rvest-webscraping-in-r-with-form-inputs
    https://stackoverflow.com/questions/64298879/is-this-site-not-suited-for-web-scraping-using-beautifulsoup
    https://stackoverflow.com/questions/64297907/python-3-extract-html-data-from-sports-site
    https://stackoverflow.com/questions/64297728/cant-get-the-fully-loaded-html-for-a-page-using-puppeteer
    https://stackoverflow.com/questions/64296859/scrape-text-from-a-span-tag-containing-nested-span-tag-in-beautifulsoup
    https://stackoverflow.com/questions/64296656/scrapy-nameerror-name-items-is-not-defined
    https://stackoverflow.com/questions/64296201/missing-values-while-scraping-using-beautifulsoup-in-python
    https://stackoverflow.com/questions/64296130/how-can-i-identify-the-element-containing-the-link-to-my-linkedin-profile-after
    https://stackoverflow.com/questions/64295959/why-use-scrapy-or-beautifulsoup-vs-just-parsing-html-with-regex-v2
    https://stackoverflow.com/questions/64295842/how-to-retreive-scrapping-data-from-web-to-json-like-format
    https://stackoverflow.com/questions/64295559/how-to-iterate-through-a-supermarket-website-and-getting-the-product-name-and-pr
    https://stackoverflow.com/questions/64295509/cant-stop-asyncio-request-for-some-delay
    https://stackoverflow.com/questions/64295244/paginate-with-network-requests-scraper
    and so on ...