Search code examples
pythonloggingscrapyscreen-scraping

Remove INFO logs from scrapy.middleware and scrapy.crawler


Does anyone know if there is a way to set different levels for scrapy's modules ? I want to log scraped item and the requests sent in a log file, but the logs coming from the scrapy.middleware, scrapy.crawler and scrapy.utils.log modules are always the same and do not add value to the log file.

My biggest constraint is that I have to do everything outside of the spiders (in the pipelines, the settings.py file, etc.). I have more than 200 spiders and cannot possibly add code to each of them.

Scrapy's doc says it is possible to modify the level for a specific logger in the advanced customization section, but it does not seem to work when this is set in the settings.py file. My guess is that the logs from scrapy.middleware and scrapy.crawler are logged before the spider evaluates the settings.py file.

I have read scrapy's doc extensively but I cannot seem to find the answer. I don't want to have to recreate my own loggers since some of Scrapy's logs are useful, like the ones logging requests sent and errors.

I can provide code extracts if necessary. Thank you.


Solution

  • You can create a scrapy extension that manipulates the various log levels setting them to higher values for the ones you do not want to appear. The first 3 logs that come from scrapy.utils.log are run before scrapy get's to loading it's extensions, so those 3 I am not sure what to do beyond turning logging off entirely and implementing the logs yourself.

    Here is an example of the extension:

    extension.py

    import logging
    from scrapy.exceptions import NotConfigured
    from scrapy import signals
    logger = logging.getLogger(__name__)
    
    class CustomLogExtension:
    
        def __init__(self):
            self.level = logging.WARNING
            self.modules = ['scrapy.utils.log', 'scrapy.middleware',
                            'scrapy.extensions.logstats', 'scrapy.statscollectors', 
                            'scrapy.core.engine', 'scrapy.core.scraper', 
                            'scrapy.crawler', 'scrapy.extensions', 
                            __name__]
            for module in self.modules:
                logger = logging.getLogger(module)
                logger.setLevel(self.level)
    
        @classmethod
        def from_crawler(cls, crawler):
            if not crawler.settings.getbool('CUSTOM_LOG_EXTENSION'):
                raise NotConfigured
            ext = cls()
            crawler.signals.connect(
                ext.spider_opened, signal=signals.spider_opened
            )
            return ext
    
        def spider_opened(self, spider):
            logger.debug("This log should not appear.")
    

    Then in your settings.py

    settings.py

    CUSTOM_LOG_EXTENSION = True
    EXTENSIONS = {
       'scrapy.extensions.telnet.TelnetConsole': None,
       'my_project_name.extension.CustomLogExtension': 1,
    }
    

    The example above removes pretty much all of the logs produced by scrapy. If you only want to keep the request logs, then just remove scrapy.core.engine from the self.modules list in the Extension constructor.