Search code examples
scrapyscrapy-pipeline

triggering a function after the finish of specific Request in scrapy


I have a complex scraping application in Scrapy that run at multiple stages (each stage is a function calling the next stage of scraping and parsing). the spider try to download multiple targets and each target consists of large number of files. what i need to do is after downloading all the files of a target is calling some function that process them and it cannot process them partially it needs the whole set of files for the target at the same time. is there a way to do it ?


Solution

  • If you cannot wait until the whole spider is finished, you will have to write some logic in an item pipeline that keeps track of what you have scraped, and executes a function then. Below is some logic to get you started: it keeps track of the number items you scraped per target, and when it reaches 100, it will execute the target_complete method. Note that you will have to fill in the field 'target' in the item of course.

    from collections import Counter
    
    class TargetCountPipeline(object):
        def __init__(self):
            self.target_counter = Counter()
            self.target_number = 100
    
        def process_item(self, item, spider):
            target = item['target']
            self.target_counter[target] += 1
            if self.target_counter[target] >= self.target_number:
                target_complete(target)
            return item
    
        def target_complete(self, target):
            # execute something here when you reached the target