Search code examples
pythonpython-3.xscrapyscrapy-pipeline

Scrapy skip request based on previous crawl from same spider


In the below example, each bucket has lots of balls. There may or may not be a red ball in either bucket. To find out if a ball is red, we crawl it.

If a red ball is found I'd like to stop crawling the rest of the balls (i.e. I don't want a request sent out for the next ball, which I know it won't be red, because I've already found it).

Bucket and balls identifiers are query params for the base URL.

What I've tried #1

maintain a class state and check if a bucket already has red ball

class BucketsBallsSpider(scrapy.Spider):
    name = 'test_spider'
    base_url = 'https://bucketswithballs.com'
    buckets = []
    balls = []
    buckets_with_red_balls = []
    
    def start_requests(self):
        for bucket in self.buckets:
            for ball in self.balls:
                if bucket in self.buckets_with_red_balls:
                    break
                url = add_or_replace_parameter(self.base_url, 'bucket', bucket)
                url = add_or_replace_parameter(url, 'ball', ball)
                yield scrapy.Request(url, self.parse)
                
    def parse(self, response, **kwargs):
        is_red_ball = response.xpath('//*[@id="is_red_ball"]').extract()
        if is_red_ball:
            bucket_id = url_query_parameter(response.url, 'bucket')
            self.buckets_with_red_balls.append(bucket_id)
            yield {'bucket_with_red_ball': bucket_id}

What I've tried #2

yield requests in parse method

class BucketsBallsSpider(scrapy.Spider):
    name = 'test_spider'
    base_url = 'https://bucketswithballs.com'
    buckets = []
    balls = []
    buckets_with_red_balls = []

    def start_requests(self):
        # Start from first bucket and first ball
        url = add_or_replace_parameter(self.base_url, 'bucket', self.buckets[0])
        url = add_or_replace_parameter(url, 'ball', self.balls[0])
        yield scrapy.Request(url, self.parse)

    def parse(self, response, **kwargs):
        is_red_ball = response.xpath('//*[@id="is_red_ball"]').extract()
        if is_red_ball:
            bucket_id = url_query_parameter(response.url, 'bucket')
            self.buckets_with_red_balls.append(bucket_id)
            yield {'bucket_with_red_ball': bucket_id}

        # Scrapy filter will skip duplicates
        for bucket in self.buckets:
            for ball in self.balls:
                if bucket in self.buckets_with_red_balls:
                    break
                url = add_or_replace_parameter(self.base_url, 'bucket', bucket)
                url = add_or_replace_parameter(url, 'ball', ball)
                yield scrapy.Request(url, self.parse)

For each example, Scrapy tells me in the console that it crawled every single URL. For performance reasons, I'd like to avoid that.


Solution

  • It won't work because Scrapy works in an asynchronous way, and I don't think you can stop the other requests as they may be already in course. You could raise a CloseSpider() exception to terminate the spider when a red ball is found, but concurrent requests would finish before the spider closes. See the Scrapy architecture here

    If you need it to stop and do not make any requests after a red ball is found, I think you want it to be synchronous. It would be probably easier to do with Python requests for example.

    Having said that, I updated your example to work synchronously (I haven't tested). This will force Scrapy to make requests 1 by 1, regardless it's configured to make several concurrent requests, not very efficient though.

    class BucketsBallsSpider(scrapy.Spider):
    name = 'test_spider'
    base_url = 'https://bucketswithballs.com'
    buckets = []
    balls = []
    
    current_bucket_idx = 0
    current_ball_idx = 0
    
    def start_requests(self):
        # Start from first bucket and first ball
        url = add_or_replace_parameter(self.base_url, 'bucket', self.buckets[0])
        url = add_or_replace_parameter(url, 'ball', self.balls[0])
        yield scrapy.Request(url, self.parse)
    
    def parse(self, response, **kwargs):
        is_red_ball = response.xpath('//*[@id="is_red_ball"]').extract()
        if is_red_ball:
            bucket_id = url_query_parameter(response.url, 'bucket')
            yield {'bucket_with_red_ball': bucket_id}
            return
    
        next_bucket, next_ball = self._get_next_bucket_and_ball()
        if not next_bucket:
            return      
    
        url = add_or_replace_parameter(self.base_url, 'bucket', next_bucket)
        url = add_or_replace_parameter(url, 'ball', next_ball)
        yield scrapy.Request(url, self.parse)
    
    def _get_next_bucket_and_ball(self):
        if self.current_ball_idx < len(self.balls) - 1:
            self.current_ball_idx += 1
    
        else:
            self.current_ball_idx = 0   
            if self.current_bucket_idx < len(self.buckets) - 1:
                self.current_bucket_idx += 1
            else:
                # No more buckets/balls to try
                return None, None
    
        next_bucket = self.buckets[self.current_bucket_idx]
        next_ball = self.balls[self.current_ball_idx]
        return next_bucket, next_ball