python python-3.x web-scraping concurrent.futures

Unable to boost the performance while parsing links from landing pages

I'm trying to implement multiprocessing within the following script using concurrent.futures. The thing is even when I use concurrent.futures, performance is still the same. It doesn't seem to have any effect on the execution process, meaning it fails to boost the performance.

I know if I create another function and pass the links populated from get_titles() to that function in order to scrape title from their inner pages, I can make this concurrent.futures work. However, I wish to get the titles from landing pages using the function that I've created below.

I used iterative approach instead of recursion only because if I go for the latter, the function will throw recursion error when more than 1000 calls are made.

This is how I've tried so far (the site link that I've used within the script is a placeholder):

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures

base = 'https://stackoverflow.com'
link = 'https://stackoverflow.com/questions/tagged/web-scraping'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}

def get_titles(link):
    while True:
        res = requests.get(link,headers=headers)
        soup = BeautifulSoup(res.text,"html.parser")
        for item in soup.select(".summary > h3"):
            post_title = item.select_one("a.question-hyperlink").get("href")
            print(urljoin(base,post_title))

        next_page = soup.select_one(".pager > a[rel='next']")

        if not next_page: return
        link = urljoin(base,next_page.get("href"))

if __name__ == '__main__':
    with futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_url = {executor.submit(get_titles,url): url for url in [link]}
        futures.as_completed(future_to_url)

QUESTION:

How can I improve the performance while parsing links from landing pages?

EDIT: I know I can achieve the same following the route below but that is not what my initial attempt is like

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures

base = 'https://stackoverflow.com'
links = ['https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'.format(i) for i in range(1,5)]

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}

def get_titles(link):
    res = requests.get(link,headers=headers)
    soup = BeautifulSoup(res.text,"html.parser")
    for item in soup.select(".summary > h3"):
        post_title = item.select_one("a.question-hyperlink").get("href")
        print(urljoin(base,post_title))

if __name__ == '__main__':
    with futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_url = {executor.submit(get_titles,url): url for url in links}
        futures.as_completed(future_to_url)

Solution

Since your crawler is using threading, why not "spawn" more workers to process the follow URLs from the landing pages?

For example:

import concurrent.futures as futures
from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup

base = "https://stackoverflow.com"
links = [
    f"{base}/questions/tagged/web-scraping?tab=newest&page={i}&pagesize=30"
    for i in range(1, 5)
]

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36",
}


def threader(function, target, workers=5):
    with futures.ThreadPoolExecutor(max_workers=workers) as executor:
        jobs = {executor.submit(function, item): item for item in target}
        futures.as_completed(jobs)


def make_soup(page_url: str) -> BeautifulSoup:
    return BeautifulSoup(requests.get(page_url).text, "html.parser")


def process_page(page: str):
    s = make_soup(page).find("div", class_="grid--cell ws-nowrap mb8")
    views = s.getText() if s is not None else "Missing data"
    print(f"{page}\n{' '.join(views.split())}")


def make_pages(soup_of_pages: BeautifulSoup) -> list:
    return [
        urljoin(base, item.select_one("a.question-hyperlink").get("href"))
        for item in soup_of_pages.select(".summary > h3")
    ]


def crawler(link):
    while True:
        soup = make_soup(link)
        threader(process_page, make_pages(soup), workers=10)
        next_page = soup.select_one(".pager > a[rel='next']")
        if not next_page:
            return
        link = urljoin(base, next_page.get("href"))


if __name__ == '__main__':
    threader(crawler, links)

Sample run output:

https://stackoverflow.com/questions/66463025/exporting-several-scraped-tables-into-a-single-csv-file
Viewed 19 times
https://stackoverflow.com/questions/66464511/can-you-find-the-parent-of-the-soup-in-beautifulsoup
Viewed 32 times
https://stackoverflow.com/questions/66464583/r-subscript-out-of-bounds-for-reading-an-html-link
Viewed 22 times

and more ...

Rationale:

Essentially, what you're doing in your initial approach is spawning workers to fetch question URLs from search pages. You don't process the follow URLs.

What I've suggested is to spawn additional workers to process what the crawling workers collected.

In your question you mention:

I wish to get the titles from landing pages

That's what a tuned version of your initial approach is trying to accomplish by utilizing the threader() function, which is basically a wrapper for ThreadPool().