Search code examples
scrapy

Scrapy scraping all pages in domain


I'm tearing my hair out trying to get scrapy to look at all pages in a domain. I tried the rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)] way (instead of the dragon_start function in my code), and didn't get anywhere. Now I'm trying to extract all links and iterate on that list, and that's not working either. What am I failing to do! Copilot isn't helping, and I looked at pretty much all the other SO posts...

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Request

class DvSpider(CrawlSpider):
    name = "dvspider"
    start_urls = ["https://dragonvale.fandom.com/wiki/Dragons"]
    allowed_domains = ["dragonvale.fandom.com/wiki"]

    def dragon_start(self, response):
        links = response.css('a::attr(href)').extract()
        for link in links:
            yield Request(response.urljoin(link), self.parse_item)

    def parse_item(self, response):
        if response.url[-7] == '_':
            dragon = response.css('table.dragonbox')

            if dragon:
                rows = dragon.xpath('//tr')
                yield {
                    'DragonName': rows[0].css('b::text').get().strip(),
                }

I don't get any errors, and scrapy crawls start_urls with a (200) code. But then the spider immediately says INFO: Closing spider (finished).


Solution

  • I think you are missing the start_requests entry point.

    For example,

    from typing import Iterable
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    from scrapy import Request
    
    class DvSpider(CrawlSpider):
        name = "dvspider"
        start_urls = ["https://dragonvale.fandom.com/wiki/Dragons"]
        allowed_domains = ["dragonvale.fandom.com/wiki"]
    
        def start_requests(self):
            yield Request(self.start_urls[0], self.dragon_start)
    
        def dragon_start(self, response):
            links = response.css('a::attr(href)').extract()
            for link in links:
                yield Request(response.urljoin(link), self.parse_item)
    
        def parse_item(self, response):
            if response.url[-7] == '_':
                dragon = response.css('table.dragonbox')
    
                if dragon:
                    rows = dragon.xpath('//tr')
    
                    yield {
                        'DragonName': rows[0].css('b::text').get().strip(),
                    }
    

    This should call the dragon_start function and start iterating the links from there.