I am using Scrapy and Scrapyd to monitor certain sites. The output files are compressed jsonlines. Right after I submit a job schedule to scrapyd, I can see the output file being created and is growing as it scrapes.
My problem is I can't be sure when the output file is ready, i.e. spider is completed. One way to do it is to rename the output file to something like "output.done" so my other programs can list these files and process them.
My current method is to check the modify time of the file, and if it doesn't change for five minutes then I assume it is done. However, five minute doesn't seem enough sometimes, and I really hope I don't need to extend it to 30min.
I got a working solution after trying different approaches.
Since in my particular case I dump the output into files, specifically bz2 files. I customized a FileFeedStorage
to do the job before opening and after closing the file. See code below:
from scrapy.contrib.feedexport import FileFeedStorage
import os
import bz2
MB = 1024 * 1024
class Bz2FileFeedStorage(FileFeedStorage):
IN_PROGRESS_MARKER = ".inprogress"
def __init__(self, uri):
super(Bz2FileFeedStorage, self).__init__(uri)
self.in_progress_file = self.path + Bz2FileFeedStorage.IN_PROGRESS_MARKER
def open(self, spider):
dirname = os.path.dirname(self.path)
if dirname and not os.path.exists(dirname):
os.makedirs(dirname)
return bz2.BZ2File(self.in_progress_file, "w", 10 * MB)
def store(self, file):
super(Bz2FileFeedStorage, self).store(file)
os.rename(self.in_progress_file, self.path)