Search code examples
pythonexceptionscrapyexit-code

Return non-zero exit code when raising a scrapy.exceptions.UsageError exception


I have a Scrapy script which looks like this:

main.py

import os
import argparse
import datetime
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from spiders.mySpider import MySpider

parser = argparse.ArgumentParser(description='My Scrapper')
parser.add_argument('-v',
                    '--verbose', 
                    help='Verbose mode',
                    action='store_true')
parser.add_argument('-t', 
                    '--type', 
                    help='Type',
                    type=str)

args = parser.parse_args()

if args.type != 'expected':
    parser.error("Wrong type")

if __name__ == "__main__":
    settings = get_project_settings()
    settings['LOG_ENABLED'] = args.verbose
    process = CrawlerProcess(settings=settings)
    process.crawl(MySpider, type_arg=args.type)
    process.start()

mySpider.py

from scrapy import Spider
from scrapy.http import Request, FormRequest
import scrapy.exceptions as ScrapyExceptions

class MySpider(Spider):
    name = 'MyScrapper'
    allowed_domains = ['www.webtoscrape.com']
    start_urls = ['http://www.webtoscrape.com/path/to/page.html']

    def parse(self, response):
        # ...
        # Some logic
        # ...

        if condition:
            raise ScrapyExceptions.UsageError(reason="Wrong argument")

When I raise a parser.error() on the main.py file, my process returns a non-zero exit code as expected. However, when I raise a scrapy.exceptions.UsageError() on the mySpider.py file, I receive a 0 exit code, so the Jenkins pipeline step I run my script on thinks it has succeded and continues with the pipeline execution. I run my script with a python3 main.py --type my_type command.

Why the script execution doesn't notice that the usage error raised on the mySpider.py module should return a non-zero exit code?


Solution

  • After several hours of trying approaches I found this thread. The problem is that Scrapy does not use a non-zero exit code when a scrape fails. I managed to fix this behaviour by using the Crawler stats collection.

    main.py

    if __name__ == "__main__":
        settings = get_project_settings()
        settings['LOG_ENABLED'] = args.verbose
        process = CrawlerProcess(settings=settings)
        process.crawl(MySpider, type_arg=args.type)
        crawler = list(process.crawlers)[0]
        process.start()
    
        failed = crawler.stats.get_value('custom/failed_job')
        if failed:
            sys.exit(1)
    

    mySpider.py

    class MySpider(Spider):
        name = 'MyScrapper'
        allowed_domains = ['www.webtoscrape.com']
        start_urls = ['http://www.webtoscrape.com/path/to/page.html']
    
        def parse(self, response):
            # ...
            # Some logic
            # ...
    
            if condition:
                self.crawler.stats.set_value('custom/failed_job', 'True')
                raise ScrapyExceptions.UsageError(reason="Wrong argument")