Search code examples
scrapyweb-crawler

Scrapy using start_requests with rules


I can't find any solution for using start_requests with rules, also I haven't seen any example on the Internet with this two. My purpose is simple, I wanna redefine start_request function to get an ability catch all exceptions dunring requests and also use meta in requests. This is a code of my spider:

class TestSpider(CrawlSpider): name = 'test' allowed_domains = ['www.oreilly.com'] start_urls = ['https://www.oreilly.com/library/view/practical-postgresql/9781449309770/ch04s05.html']

# Base on scrapy doc
def start_requests(self):
    for u in self.start_urls:
        yield Request(u, callback=self.parse_item, errback=self.errback_httpbin, dont_filter=True)

rules = (
    Rule(LinkExtractor(), callback='parse_item', follow=True),
)

def parse_item(self, response):
    item = {}
    item['title'] = response.xpath('//head/title/text()').extract()
    item['url'] = response.url
    yield item

def errback_httpbin(self, failure):
    self.logger.error('ERRRRROR - {}'.format(failure))

This code scrape only one page. I try to modify it and instead of:

def parse_item(self, response):
    item = {}
    item['title'] = response.xpath('//head/title/text()').extract()
    item['url'] = response.url
    yield item

I've tried to use this, based on this answer

def parse_item(self, response):
    item = {}
    item['title'] = response.xpath('//head/title/text()').extract()
    item['url'] = response.url
    return self.parse(response) 

It seems to work, but it doesn't scrape anything, even if I add parse function to my spider. Does anybody know how to use start_request and rules together? I will be glad any information about this topic. Have a nice coding!


Solution

  • Here is a solution for handle errback in LinkExtractor

    Thanks this dude!