Search code examples
scrapyscrapy-pipeline

Scrapy - Dynamic file naming form parsed item


I'm working on scraping program for an art museum.
I'm new to the Scrapy framework and intermediate in python at best
I need to download images from the website and name them accordingly with the value form the parsed data.
I've been going through Scrapy documentation and Google searches but no luck so far. I'm stuck at the pipeline.
I know how I could fix file names after running the Scrapy with wrapper program, but that seem counter productive and sloppy.

Each item yielded from the spider looks like this:

{'Artist': 'SomeArtist',
 ...
 'Image Url': 'https://www.nationalgallery.org.uk/media/33219/n-1171-00-000049-hd.jpg',
 'Inventory number': 'NG1171'}

I need to name the image by 'Inventory number'

I managed to make a custom pipeline, but no luck making it work the way I want to.
The closest I got was this, but it failed miserably by assigning same self.file_name value to many images

class DownloadPipeline(ImagesPipeline):
         def get_media_requests(self, item, info):
             # The only point, that I've found, for accessing item dict before downloading
             self.file_name = item['Inventory number']
             yield Request(item["Image Url"])
    
         def file_path(self, request, response=None, info=None):
             return f"Images/{self.file_name}.jpg"

Something like this would be great:

class DownloadPipeline(ImagesPipeline):
    
         def file_path(self, request, item, response=None, info=None):
             file_name = item['Inventory number']
             return f"Images/{file_name}.jpg"

Is there any way to make that work?


Solution

  • When you yield the request in get_media_requests you can pass arbitrary data inside the meta param, so you can access as an attribute of request in your file_path method.

    class DownloadPipeline(ImagesPipeline):
        def get_media_requests(self, item, info):
            yield Request(
                url=item["Image Url"],
                meta={'inventory_number': item.get('Inventory number')}
            )
        
        def file_path(self, request, response=None, info=None):
            file_name = request.meta.get('inventory_number)
            return f"Images/{file_name}.jpg"
    

    Read more here