I'm tearing my hair out trying to get scrapy to look at all pages in a domain. I tried the rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]
way (instead of the dragon_start
function in my code), and didn't get anywhere. Now I'm trying to extract all links and iterate on that list, and that's not working either. What am I failing to do! Copilot isn't helping, and I looked at pretty much all the other SO posts...
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Request
class DvSpider(CrawlSpider):
name = "dvspider"
start_urls = ["https://dragonvale.fandom.com/wiki/Dragons"]
allowed_domains = ["dragonvale.fandom.com/wiki"]
def dragon_start(self, response):
links = response.css('a::attr(href)').extract()
for link in links:
yield Request(response.urljoin(link), self.parse_item)
def parse_item(self, response):
if response.url[-7] == '_':
dragon = response.css('table.dragonbox')
if dragon:
rows = dragon.xpath('//tr')
yield {
'DragonName': rows[0].css('b::text').get().strip(),
}
I don't get any errors, and scrapy crawls start_urls
with a (200) code. But then the spider immediately says INFO: Closing spider (finished)
.
I think you are missing the start_requests
entry point.
For example,
from typing import Iterable
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Request
class DvSpider(CrawlSpider):
name = "dvspider"
start_urls = ["https://dragonvale.fandom.com/wiki/Dragons"]
allowed_domains = ["dragonvale.fandom.com/wiki"]
def start_requests(self):
yield Request(self.start_urls[0], self.dragon_start)
def dragon_start(self, response):
links = response.css('a::attr(href)').extract()
for link in links:
yield Request(response.urljoin(link), self.parse_item)
def parse_item(self, response):
if response.url[-7] == '_':
dragon = response.css('table.dragonbox')
if dragon:
rows = dragon.xpath('//tr')
yield {
'DragonName': rows[0].css('b::text').get().strip(),
}
This should call the dragon_start
function and start iterating the links from there.