Search code examples
pythonpython-3.xweb-scrapingconcurrent.futures

Unable to boost the performance while parsing links from landing pages


I'm trying to implement multiprocessing within the following script using concurrent.futures. The thing is even when I use concurrent.futures, performance is still the same. It doesn't seem to have any effect on the execution process, meaning it fails to boost the performance.

I know if I create another function and pass the links populated from get_titles() to that function in order to scrape title from their inner pages, I can make this concurrent.futures work. However, I wish to get the titles from landing pages using the function that I've created below.

I used iterative approach instead of recursion only because if I go for the latter, the function will throw recursion error when more than 1000 calls are made.

This is how I've tried so far (the site link that I've used within the script is a placeholder):

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures

base = 'https://stackoverflow.com'
link = 'https://stackoverflow.com/questions/tagged/web-scraping'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}

def get_titles(link):
    while True:
        res = requests.get(link,headers=headers)
        soup = BeautifulSoup(res.text,"html.parser")
        for item in soup.select(".summary > h3"):
            post_title = item.select_one("a.question-hyperlink").get("href")
            print(urljoin(base,post_title))

        next_page = soup.select_one(".pager > a[rel='next']")

        if not next_page: return
        link = urljoin(base,next_page.get("href"))

if __name__ == '__main__':
    with futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_url = {executor.submit(get_titles,url): url for url in [link]}
        futures.as_completed(future_to_url)

QUESTION:

How can I improve the performance while parsing links from landing pages?

EDIT: I know I can achieve the same following the route below but that is not what my initial attempt is like

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures

base = 'https://stackoverflow.com'
links = ['https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'.format(i) for i in range(1,5)]

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}

def get_titles(link):
    res = requests.get(link,headers=headers)
    soup = BeautifulSoup(res.text,"html.parser")
    for item in soup.select(".summary > h3"):
        post_title = item.select_one("a.question-hyperlink").get("href")
        print(urljoin(base,post_title))

if __name__ == '__main__':
    with futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_url = {executor.submit(get_titles,url): url for url in links}
        futures.as_completed(future_to_url)

Solution

  • Since your crawler is using threading, why not "spawn" more workers to process the follow URLs from the landing pages?

    For example:

    import concurrent.futures as futures
    from urllib.parse import urljoin
    
    import requests
    from bs4 import BeautifulSoup
    
    base = "https://stackoverflow.com"
    links = [
        f"{base}/questions/tagged/web-scraping?tab=newest&page={i}&pagesize=30"
        for i in range(1, 5)
    ]
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 "
                      "(KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36",
    }
    
    
    def threader(function, target, workers=5):
        with futures.ThreadPoolExecutor(max_workers=workers) as executor:
            jobs = {executor.submit(function, item): item for item in target}
            futures.as_completed(jobs)
    
    
    def make_soup(page_url: str) -> BeautifulSoup:
        return BeautifulSoup(requests.get(page_url).text, "html.parser")
    
    
    def process_page(page: str):
        s = make_soup(page).find("div", class_="grid--cell ws-nowrap mb8")
        views = s.getText() if s is not None else "Missing data"
        print(f"{page}\n{' '.join(views.split())}")
    
    
    def make_pages(soup_of_pages: BeautifulSoup) -> list:
        return [
            urljoin(base, item.select_one("a.question-hyperlink").get("href"))
            for item in soup_of_pages.select(".summary > h3")
        ]
    
    
    def crawler(link):
        while True:
            soup = make_soup(link)
            threader(process_page, make_pages(soup), workers=10)
            next_page = soup.select_one(".pager > a[rel='next']")
            if not next_page:
                return
            link = urljoin(base, next_page.get("href"))
    
    
    if __name__ == '__main__':
        threader(crawler, links)
    

    Sample run output:

    https://stackoverflow.com/questions/66463025/exporting-several-scraped-tables-into-a-single-csv-file
    Viewed 19 times
    https://stackoverflow.com/questions/66464511/can-you-find-the-parent-of-the-soup-in-beautifulsoup
    Viewed 32 times
    https://stackoverflow.com/questions/66464583/r-subscript-out-of-bounds-for-reading-an-html-link
    Viewed 22 times
    
    and more ...
    

    Rationale:

    Essentially, what you're doing in your initial approach is spawning workers to fetch question URLs from search pages. You don't process the follow URLs.

    What I've suggested is to spawn additional workers to process what the crawling workers collected.

    In your question you mention:

    I wish to get the titles from landing pages

    That's what a tuned version of your initial approach is trying to accomplish by utilizing the threader() function, which is basically a wrapper for ThreadPool().