I have a complex scraping application in Scrapy that run at multiple stages (each stage is a function calling the next stage of scraping and parsing). the spider try to download multiple targets and each target consists of large number of files. what i need to do is after downloading all the files of a target is calling some function that process them and it cannot process them partially it needs the whole set of files for the target at the same time. is there a way to do it ?
If you cannot wait until the whole spider is finished, you will have to write some logic in an item pipeline that keeps track of what you have scraped, and executes a function then. Below is some logic to get you started: it keeps track of the number items you scraped per target, and when it reaches 100, it will execute the target_complete method. Note that you will have to fill in the field 'target' in the item of course.
from collections import Counter
class TargetCountPipeline(object):
def __init__(self):
self.target_counter = Counter()
self.target_number = 100
def process_item(self, item, spider):
target = item['target']
self.target_counter[target] += 1
if self.target_counter[target] >= self.target_number:
target_complete(target)
return item
def target_complete(self, target):
# execute something here when you reached the target