Search code examples
pythonscrapy

How can I extract links from webpages using scrapy?


I am trying to extract links from webpages that follow a certain rule. I tried using scrapy with the following code:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request


class MagazineCrawler(CrawlSpider):
    name = "MagazineCrawler"
    allowed_domains = ["eu-startups.com"]
    start_urls = ["https://www.eu-startups.com"]

    rules = (
        Rule(LinkExtractor(allow=["category/interviews"]), callback="parse_category"),
    )

    def parse_category(self, response):
        xpath_links = "//div[@class='td_block_inner tdb-block-inner td-fix-index']//a[@class='td-image-wrap ']/@href"
        subpage_links = response.xpath(xpath_links).extract()

        # Follow each subpage link and yield requests to crawl them
        for link in subpage_links:
            yield Request(link)

The problem is that it only extracts links from the first link provided by the pattern and then it stops. If I remove the parse_category callback option it goes normally through all the webpages that have in them "category/interviews". Why is this happening?


Solution

  • This is happening because you need to set the follow parameter in your rule if you are planning on using it with a callback.

    From the scrapy docs for the Rule class:

    class scrapy.spiders.Rule(link_extractor=None, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None, errback=None)

    follow is a boolean which specifies if links should be followed from each response extracted with this rule. If callback is None follow defaults to True, otherwise it defaults to False.

    So if you want the spider to continue following the links and use a callback for each of the responses, then you can simply set follow=True in your spiders rule.

    For example:

    class MagazineCrawler(CrawlSpider):
        name = "MagazineCrawler"
        allowed_domains = ["eu-startups.com"]
        start_urls = ["https://www.eu-startups.com"]
    
        rules = (
            Rule(LinkExtractor(allow=["category/interviews"]),
                 callback="parse_category", 
                 follow=True),
        )
    
        def parse_category(self, response):
            xpath_links = "//div[@class='td_block_inner tdb-block-inner td-fix-index']//a[@class='td-image-wrap ']/@href"
            subpage_links = response.xpath(xpath_links).extract()
    
            # Follow each subpage link and yield requests to crawl them
            for link in subpage_links:
                yield Request(link)