Search code examples
pythonscrapymiddlewarescrapy-middleware

Access Spider self object on custom middleware


I am trying to notice when there is a problem with the page I am scrapping. In case the response has not a valid status code, I want to write a custom value in the crawler stats so that I can return a non-zero exit code from my process. This is what I have write until now:

MySpider.py

from spiders.utils.logging_utils import inform_user

class MySpider(Spider):
    name = 'MyScrapper'
    allowed_domains = ['www.mydomain.es']
    start_urls = ['http://www.mydomain/Download.html']
    custom_settings = {
        "SPIDER_MIDDLEWARES": {
            "scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": None
        }
    }

    def parse(self, response):
        if response.status != 200:
            message = "ERROR {} on request.".format(response.status)
            reason = 'Status response not valid'
            inform_user(self, 'ERROR', message, close_spider=True, reason=reason)
        ...

utils/logging_utils.py

def inform_user(self, level, message, close_spider=False, reason=''):
    level = level.upper() if isinstance(level, str) else ''
    levels = {
        'CRITICAL': 50,
        'ERROR': 40,
        'WARNING': 30,
        'INFO': 20,
        'DEBUG': 10
    }
    self.logger.log(levels.get(level, 0), message)
    if close_spider:
        self.crawler.stats.set_value('custom/failed_job', 'True')
        raise ScrapyExceptions.UsageError(reason=reason)

This works as expected, however I don't think that removing the HttpErrorMiddleware is a good practice. That's why I am trying to write a custom middleware which sets the stats in the crawler:

MySpider.py

from spiders.utils.logging_utils import inform_user

class CustomHttpErrorMiddleware(HttpErrorMiddleware):    
    def process_spider_exception(self, response, exception, spider):
        super().process_spider_exception(response, exception, spider)

        if response.status != 200:
            message = "ERROR {} on request.".format(response.status)
            reason = 'Status response not valid'
            inform_user(self, 'ERROR', message, close_spider=True, reason=reason)

class MySpider(Spider):
    name = 'MyScrapper'
    allowed_domains = ['www.mydomain.es']
    start_urls = ['http://www.mydomain/Download.html']
    custom_settings = {
        "SPIDER_MIDDLEWARES": {
            "scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": None,
            CustomHttpErrorMiddleware: 50
        }
    }

However, now I am calling the inform_user function on the middleware definition, so I don't have access to the Spider self object, which contains the self.logger and self.crawler objects used by the function. How can I make that Spider self object available on the middleware?


Solution

  • The spider self object is the argument named spider in the process_spider_exception method of the middleware. You can use it like below spider.logger.info(...)