Search code examples
pythonasynchronousscrapytwisteddropbox

Pipeline to post item into storage service


I want a pipeline to async POST items to a storage service. I have the thought of using something like FilePipeline for this. FilePipeline comes with a lot of overhead, because it assumes I want to save files to disk, but here I just want to post the files to a storage API. However, it does have a method that yields Requests: get_media_requests().

I currently get FileException failure, and I don't know how to eliminate the component that saves to disk. Is there a way to make this work nicely?

class StoragePipeline(FilePipeline):


    access_token = os.environ['access_token']

    def get_media_requests(self, item, info):

        filename = item['filename']


        headers = {
            'Authorization': f'Bearer {self.access_token}',
            'Dropbox-API-Arg': f'{{"path": "/{filename}"}}',
            'Content-Type': 'application/octet-stream',
        }

        request = Request(
            method='POST',
            url='https://content.dropboxapi.com/2/files/upload',
            headers=headers,
            body=item['data'],

        )

        yield request


    def item_completed(self, results, item, info):

        return item

Solution

  • You can schedule scrapy requests in pipelines by exposing crawler and scheduling your request directly:

    class MyPipeline(object):
        def __init__(self, crawler):
            self.crawler = crawler
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(crawler)
    
        def process_item(self, item, spider):
            if item['some_extra_field']:  # check if we already did below
                return item
            req = scrapy.Request('some_url', self.check_deploy,
                                 method='POST', meta={'item': item})
            self.crawler.engine.crawl(req, spider)
            return item
    
        def check_deploy(self, response):
            # if not 200 we might want to retry
            if response.status != 200: 
                return response.meta['item']