Example of Scrapy script that traverses all URLs on a given site, and yields URL for each page as a variable

Is there a simple example of a Scrapy script that can be called from a python script and visits each URL on a site, giving the URL of each page visited.

This is what I have so far, but this definitely doesn't work. It won't even run.

from scrapy.spiders import CrawlSpider
from twisted.internet import process


class MySpider(CrawlSpider):
    name = 'toscrape'
    allowed_domains = ['toscrape.com']
    start_urls = ['http://books.toscrape.com']

    def parse(self, response):
        do_something(response.url)

    def do_something(self, url):
        # pass do something here
        pass

    
process.crawl(MySpider)
process.start()

Solution

You are actually not that far off.

Really the only changes that you need to make are using the CrawlerProcess from scrapy instead of the twisted version, and then handle the pagination for the site and/or iterate the list of page urls directly. The former is the better option.

Like this:

import scrapy
from scrapy.crawler import CrawlerProcess


class MySpider(scrapy.Spider):
    name = 'toscrape'
    allowed_domains = ['toscrape.com']
    start_urls = ['http://books.toscrape.com']

    def parse(self, response):
        yield {"url": response.url}
        next_link = response.xpath('//li[@class="next"]/a/@href').get()
        if next_link:
            yield scrapy.Request(response.urljoin(next_link))


process = CrawlerProcess()    
process.crawl(MySpider)
process.start()