Search code examples
pythonscrapyscrapinghub

scrapy access to log count while running in scrapinghub


I have a small scrapy extension which looks into the stats object of a crawler and sends me an email if the crawler has thrown log messages of a certain type (e.g. WARNING, CRITICAL, ERROR).

These stats are accessible by the spiders stats object (crawler.stats.get_stats()), e.g.:

crawler.stats.get_stats().items()
 [..]
 'log_count/DEBUG': 9,
 'log_count/ERROR': 2,
 'log_count/INFO': 4,
 [..]

If I run the spider on scrapinghub, the log stats are not there. There are a lot of other thins (e.g. exception count, etc..) but the log count is missing. Does someone know how to get them there or how to access them on scraping hub?

I've also checked the "Dumping Scrapy stats" values after a spider closes. If I run it on my machine the log count is there, If I run it on scrapinghub the log count is missing.


Solution

  • This might also help someone else. I wrote a small plugin to collect the log stats and save them in the stats dict with a own prefix.

    to activate it, save it to a file (eg. loggerstats.py) and activate it as an extension in your crawlers settings.py:

    EXTENSIONS = {
        'loggerstats.LoggerStats': 10,
    }
    

    the script:

    from scrapy import log
    from scrapy.log import level_names
    from twisted.python import log as txlog
    
    
    class LoggerStats(object):
    
        def __init__(self, crawler, prefix='stats_', level=log.INFO):
            self.level = level
            self.crawler = crawler
            self.prefix = prefix
            txlog.startLoggingWithObserver(self.emit, setStdout=False)
    
        @classmethod
        def from_crawler(cls, crawler):
            o = cls(crawler)
            return o
    
        def emit(self, ev):
            level = ev.get('logLevel')
            if level >= self.level:
                sname = '%slog_count/%s' % (self.prefix, level_names.get(level, level))
                self.crawler.stats.inc_value(sname)
    

    It will then count the logs and maintain the count in the crawler stats. For example:

    stats_log_count/INFO: 10
    stats_log_count/WARNING: 1
    stats_log_count/CRITICAL: 5