Search code examples
pythonweb-scrapingwebscrapy

Trouble scraping BBC with Python Scrapy (2023)


we want to scrape articles (content + headline) to extend our dataset for text classification purposes.

GOAL: scrape all articles from all pages at >> https://www.bbc.com/news/technology

PROBLEM: It seems like that the code only extracts the articles from https://www.bbc.com/news/technology?page=1, even tho, we follow all pages. Clould there be a problem in how we follow the pages?

class BBCSpider_2(scrapy.Spider):

    name = "bbc_tech"
    start_urls = ["https://www.bbc.com/news/technology"]


    def parse(self, response: Response, **kwargs: Any) -> Any:
        max_pages = response.xpath("//nav[@aria-label='Page']/div/div/div/div/ol/li[last()]/div/a//text()").get()
        max_pages = int(max_pages)
        for p in range(max_pages):
            page = f"https://www.bbc.com/news/technology?page={p+1}"
            yield response.follow(page, callback=self.parse_articles2)

Next, we are going into each article on the corresponding page:

    def parse_articles2(self, response):
        container_to_scan = [4, 8]
        for box in container_to_scan:
            if box == 4:
                articles = response.xpath(f"//*[@id='main-content']/div[{box}]/div/div/ul/li")
            if box == 8:
                articles = response.xpath(f"//*[@id='main-content']/div[{box}]/div[2]/ol/li")
            for article_idx in range(len(articles)):
                if box == 4:
                    relative_url = response.xpath(f"//*[@id='main-content']/div[4]/div/div/ul/li[{article_idx+1}]/div/div/div/div[1]/div[1]/a/@href").get()
                elif box == 8:
                    relative_url = response.xpath(f"//*[@id='main-content']/div[8]/div[2]/ol/li[{article_idx+1}]/div/div/div[1]/div[1]/a/@href").get()
                else:
                    relative_url = None

                if relative_url is not None:
                    followup_url = "https://www.bbc.com" + relative_url
                    yield response.follow(followup_url, callback=self.parse_article)

Last but not least we are scraping the content and title of each article:

    def parse_article(response):
        article_text = response.xpath("//article/div[@data-component='text-block']")
        content = []
        for box in article_text:
            text = box.css("div p::text").get()
            if text is not None:
                content.append(text)

        title = response.css("h1::text").get()

        yield {
            "title": title,
            "content": content,
        }

When we run this we get an items_scraped_count of 24. But it should be 24 x 29 +/- ...


Solution

  • It appears that your subesequent calls to page 2 and 3 and so on are being filtered by scrapy's duplicate filtering functionality, and the reason that is happening is because the site keeps serving the same front page no matter what page number you put into the url query. After rendering the front page it uses a json api to get the actual article information for the page requested, which isn't capable of being captured by scrapy alone unless you call the api directly.

    The json api can be discovered in the your browsers dev tools in the network tab, or I use it in the example below. You simply need to enter in the desired page number similar to how you already were doing for .../news/technology?page=? url. See the example below...

    One other thing... your parse_article method is missing the self as the first parameter, which would throw an error and prevent you from actually scraping any of the page content. I also rewrote a couple of your xpaths, to make them a bit more readable.

    import scrapy
    
    class BBCSpider_2(scrapy.Spider):
        name = "bbc_tech"
        start_urls = ["https://www.bbc.com/news/technology"]
    
        def parse(self, response):
            max_pages = response.xpath("//nav[@aria-label='Page']//ol/li[last()]//text()").get()
            for article in response.xpath("//div[@type='article']"):
                if link := article.xpath(".//a[contains(@class, 'LinkPostLink')]/@href").get():
                    yield response.follow(link, callback=self.parse_article)
            for i in range(2, int(max_pages)):
                yield scrapy.Request(f"https://www.bbc.com/wc-data/container/topic-stream?adSlotType=mpu_middle&enableDotcomAds=true&isUk=false&lazyLoadImages=true&pageNumber={i}&pageSize=24&promoAttributionsToSuppress=%5B%22%2Fnews%22%2C%22%2Fnews%2Ffront_page%22%5D&showPagination=true&title=Latest%20News&tracking=%7B%22groupName%22%3A%22Latest%20News%22%2C%22groupType%22%3A%22topic%20stream%22%2C%22groupResourceId%22%3A%22urn%3Abbc%3Avivo%3Acuration%3Ab2790c4d-d5c4-489a-84dc-be0dcd3f5252%22%2C%22groupPosition%22%3A5%2C%22topicId%22%3A%22cd1qez2v2j2t%22%7D&urn=urn%3Abbc%3Avivo%3Acuration%3Ab2790c4d-d5c4-489a-84dc-be0dcd3f5252", callback=self.parse_json)
    
        def parse_json(self, response):
            for post in response.json()["posts"]:
                yield scrapy.Request(response.urljoin(post["url"]), callback=self.parse_article)
    
        def parse_article(self, response):
            article_text = response.xpath("//article/div[@data-component='text-block']//text()").getall()
            content = " ".join([i.strip() for i in article_text])
            title = response.css("h1::text").get()
            yield {
                "title": title,
                "content": content,
            }