Is there a simple example of a Scrapy script that can be called from a python script and visits each URL on a site, giving the URL of each page visited.
This is what I have so far, but this definitely doesn't work. It won't even run.
from scrapy.spiders import CrawlSpider
from twisted.internet import process
class MySpider(CrawlSpider):
name = 'toscrape'
allowed_domains = ['toscrape.com']
start_urls = ['http://books.toscrape.com']
def parse(self, response):
do_something(response.url)
def do_something(self, url):
# pass do something here
pass
process.crawl(MySpider)
process.start()
You are actually not that far off.
Really the only changes that you need to make are using the CrawlerProcess from scrapy instead of the twisted version, and then handle the pagination for the site and/or iterate the list of page urls directly. The former is the better option.
Like this:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['toscrape.com']
start_urls = ['http://books.toscrape.com']
def parse(self, response):
yield {"url": response.url}
next_link = response.xpath('//li[@class="next"]/a/@href').get()
if next_link:
yield scrapy.Request(response.urljoin(next_link))
process = CrawlerProcess()
process.crawl(MySpider)
process.start()