I am trying to notice when there is a problem with the page I am scrapping. In case the response has not a valid status code, I want to write a custom value in the crawler stats so that I can return a non-zero exit code from my process. This is what I have write until now:
MySpider.py
from spiders.utils.logging_utils import inform_user
class MySpider(Spider):
name = 'MyScrapper'
allowed_domains = ['www.mydomain.es']
start_urls = ['http://www.mydomain/Download.html']
custom_settings = {
"SPIDER_MIDDLEWARES": {
"scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": None
}
}
def parse(self, response):
if response.status != 200:
message = "ERROR {} on request.".format(response.status)
reason = 'Status response not valid'
inform_user(self, 'ERROR', message, close_spider=True, reason=reason)
...
utils/logging_utils.py
def inform_user(self, level, message, close_spider=False, reason=''):
level = level.upper() if isinstance(level, str) else ''
levels = {
'CRITICAL': 50,
'ERROR': 40,
'WARNING': 30,
'INFO': 20,
'DEBUG': 10
}
self.logger.log(levels.get(level, 0), message)
if close_spider:
self.crawler.stats.set_value('custom/failed_job', 'True')
raise ScrapyExceptions.UsageError(reason=reason)
This works as expected, however I don't think that removing the HttpErrorMiddleware is a good practice. That's why I am trying to write a custom middleware which sets the stats in the crawler:
MySpider.py
from spiders.utils.logging_utils import inform_user
class CustomHttpErrorMiddleware(HttpErrorMiddleware):
def process_spider_exception(self, response, exception, spider):
super().process_spider_exception(response, exception, spider)
if response.status != 200:
message = "ERROR {} on request.".format(response.status)
reason = 'Status response not valid'
inform_user(self, 'ERROR', message, close_spider=True, reason=reason)
class MySpider(Spider):
name = 'MyScrapper'
allowed_domains = ['www.mydomain.es']
start_urls = ['http://www.mydomain/Download.html']
custom_settings = {
"SPIDER_MIDDLEWARES": {
"scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": None,
CustomHttpErrorMiddleware: 50
}
}
However, now I am calling the inform_user
function on the middleware definition, so I don't have access to the Spider self
object, which contains the self.logger
and self.crawler
objects used by the function. How can I make that Spider self
object available on the middleware?
The spider self
object is the argument named spider
in the process_spider_exception
method of the middleware. You can use it like below
spider.logger.info(...)