Search code examples
pythonscrapy

Scrapy Crawl only first 5 pages of the site


I am working on the solution to the following problem, My boss wants from me to create a CrawlSpider in Scrapy to scrape the article details like title, description and paginate only the first 5 pages.

I created a CrawlSpider but it is paginating from all the pages, How can I restrict the CrawlSpider to paginate only the first latest 5 pages?

The site article listing page markup that opens when we click on pagination next link:

Listing page markup:

    <div class="list">
      <div class="snippet-content">
        <h2>
          <a href="https://example.com/article-1">Article 1</a>
        </h2>
      </div>
      <div class="snippet-content">
        <h2>
          <a href="https://example.com/article-2">Article 2</a>
        </h2>
      </div>
      <div class="snippet-content">
        <h2>
          <a href="https://example.com/article-3">Article 3</a>
        </h2>
      </div>
      <div class="snippet-content">
        <h2>
          <a href="https://example.com/article-4">Article 4</a>
        </h2>
      </div>
    </div>
    <ul class="pagination">
      <li class="next">
        <a href="https://www.example.com?page=2&keywords=&from=&topic=&year=&type="> Next </a>
      </li>
    </ul>

For this, I am using Rule object with restrict_xpaths argument to get all the article links, and for the follow I am executing parse_item class method that will get the article title and description from the meta tags.

Rule(LinkExtractor(restrict_xpaths='//div[contains(@class, "snippet-content")]/h2/a'), callback="parse_item",
             follow=True)

Detail page markup:

<meta property="og:title" content="Article Title">
<meta property="og:description" content="Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.">

After this, I have added another Rule object to handle pagination CrawlSpider will use the following link to open other listing page and do the same procedure again and again.

Rule(LinkExtractor(restrict_xpaths='//ul[@class="pagination"]/li[@class="next"]/a'))

This is my CrawlSpider code:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import w3lib.html


class ExampleSpider(CrawlSpider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = ["https://www.example.com/"]
    custom_settings = {
        'FEED_URI': 'articles.json',
        'FEED_FORMAT': 'json'
    }
    total = 0

   
    rules = (
        # Get the list of all articles on the one page and follow these links
        Rule(LinkExtractor(restrict_xpaths='//div[contains(@class, "snippet-content")]/h2/a'), callback="parse_item",
             follow=True),
        # After that get pagination next link get href and follow it, repeat the cycle
        Rule(LinkExtractor(restrict_xpaths='//ul[@class="pagination"]/li[@class="next"]/a'))
    )

    def parse_item(self, response):
        self.total = self.total + 1
        title = response.xpath('//meta[@property="og:title"]/@content').get() or ""
        description = w3lib.html.remove_tags(response.xpath('//meta[@property="og:description"]/@content').get()) or ""
       
        return {
            'id': self.total,
            'title': title,
            'description': description
        }

Is there a way we can restrict the crawler to crawl only the first 5 pages?


Solution

  • Solution 1: use process_request.

    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    
    
    def limit_requests(request, response):
        # here we have the page number.
        # page_number = request.url[-1]
        # if int(page_number) >= 6:
        #     return None
    
        # here we use a counter
        if not hasattr(limit_requests, "page_number"):
            limit_requests.page_number = 0
        limit_requests.page_number += 1
    
        if limit_requests.page_number >= 5:
            return None
    
        return request
    
    
    class ExampleSpider(CrawlSpider):
        name = 'example_spider'
    
        start_urls = ['https://scrapingclub.com/exercise/list_basic/']
        page = 0
        rules = (
            # Get the list of all articles on the one page and follow these links
            Rule(LinkExtractor(restrict_xpaths='//div[@class="card-body"]/h4/a'), callback="parse_item",
                 follow=True),
            # After that get pagination next link get href and follow it, repeat the cycle
            Rule(LinkExtractor(restrict_xpaths='//li[@class="page-item"][last()]/a'), process_request=limit_requests)
        )
        total = 0
    
        def parse_item(self, response):
            title = response.xpath('//h3//text()').get(default='')
            price = response.xpath('//div[@class="card-body"]/h4//text()').get(default='')
            self.total = self.total + 1
    
            return {
                'id': self.total,
                'title': title,
                'price': price
            }
    

    Solution 2: overwrite _requests_to_follow method (should be slower though).

    from scrapy.http import HtmlResponse
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    
    
    class ExampleSpider(CrawlSpider):
        name = 'example_spider'
    
        start_urls = ['https://scrapingclub.com/exercise/list_basic/']
    
        rules = (
            # Get the list of all articles on the one page and follow these links
            Rule(LinkExtractor(restrict_xpaths='//div[@class="card-body"]/h4/a'), callback="parse_item",
                 follow=True),
            # After that get pagination next link get href and follow it, repeat the cycle
            Rule(LinkExtractor(restrict_xpaths='//li[@class="page-item"][last()]/a'))
        )
        total = 0
        page = 0
        
        def _requests_to_follow(self, response):
            if not isinstance(response, HtmlResponse):
                return
            if self.page >= 5:  # stopping condition
                return
            seen = set()
            for rule_index, rule in enumerate(self._rules):
                links = [
                    lnk
                    for lnk in rule.link_extractor.extract_links(response)
                    if lnk not in seen
                ]
                for link in rule.process_links(links):
                    if rule_index == 1: # assuming there's only one "next button"
                        self.page += 1
                    seen.add(link)
                    request = self._build_request(rule_index, link)
                    yield rule.process_request(request, response)
    
        def parse_item(self, response):
            title = response.xpath('//h3//text()').get(default='')
            price = response.xpath('//div[@class="card-body"]/h4//text()').get(default='')
            self.total = self.total + 1
    
            return {
                'id': self.total,
                'title': title,
                'price': price
            }
    

    The solutions are pretty much self explanatory, if you want me to add something please ask in the comments.