Search code examples
python-3.xweb-scrapingscrapyscrapy-pipeline

Why scrapy image pipeline is not downloading images?


I am trying to download all the images from the product gallery. I have tried the mentioned script but somehow I am not able to download the images. I could manage to download the main image which contains an id. The other images from the gallery do not contain any id and I failed to download them.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BasicSpider(CrawlSpider):
    name = 'basic'
    allowed_domains = ['www.leebmann24.de']
    start_urls = ['https://www.leebmann24.de/bmw.html']

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//div[@class='category-products']/ul/li/h2/a"), callback='parse_item'),
        Rule(LinkExtractor(restrict_xpaths="//li[@class='next']/a"), callback='parse_item', follow=True),
    )

    def parse_item(self, response):

        yield {
            'URL': response.url,
            'Price': response.xpath("normalize-space(//span[@class='price']/text())").get(),
            'image_urls': response.xpath("//div[@class='item']/a/img/@src").getall()
        } 

Solution

  • @Raisul Islam, '//*[@id="image-main"]/@src' is generating the image url and I'm not getting any issues. Please, see the output whether that's your expacted or not.

    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    
    class BasicSpider(CrawlSpider):
        name = 'basic'
        allowed_domains = ['www.leebmann24.de']
        start_urls = ['https://www.leebmann24.de/bmw.html']
    
        rules = (
            Rule(LinkExtractor(restrict_xpaths="//div[@class='category-products']/ul/li/h2/a"), callback='parse_item'),
            Rule(LinkExtractor(restrict_xpaths="//li[@class='next']/a"), callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
    
            yield {
                'URL': response.url,
                'Price': response.xpath("normalize-space(//span[@class='price']/text())").get(),
                'image_urls': response.xpath('//*[@id="image-main"]/@src').get()
            } 
    

    Output:

    {'URL': 'https://www.leebmann24.de/aruma-antirutschmatte-3er-f30-f31.html', 'Price': '57,29\xa0€', 'image_urls': 'https://www.leebmann24.de/media/catalog/product/cache/1/image/363x/040ec09b1e35df139433887a97daa66f/a/r/aruma-antirutschmatte-94452302924-1.jpg'}
    2022-09-07 02:35:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.leebmann24.de/bmw-erste-hilfe-set-klarsichtbeutel-51477158344.html> (referer: https://www.leebmann24.de/bmw.html?p=2)
    2022-09-07 02:35:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.leebmann24.de/bmw-erste-hilfe-set-klarsichtbeutel-51477158344.html>
    {'URL': 'https://www.leebmann24.de/bmw-erste-hilfe-set-klarsichtbeutel-51477158344.html', 'Price': '15,64\xa0€', 'image_urls': 'https://www.leebmann24.de/media/catalog/product/cache/1/image/363x/040ec09b1e35df139433887a97daa66f/b/m/bmw-erste-hilfe-klarsichtbeutel-51477158433.jpg'}
    2022-09-07 02:35:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.leebmann24.de/erste-hilfe-set.html> (failed 1 times): 503 Service Unavailable
    2022-09-07 02:35:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.leebmann24.de/aruma-antirutschmatte-x5-f15.html> (referer: https://www.leebmann24.de/bmw.html)
    2022-09-07 02:35:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.leebmann24.de/aruma-antirutschmatte-x5-f15.html>
    {'URL': 'https://www.leebmann24.de/aruma-antirutschmatte-x5-f15.html', 'Price': '71,66\xa0€', 'image_urls': 'https://www.leebmann24.de/media/catalog/product/cache/1/image/363x/040ec09b1e35df139433887a97daa66f/a/r/aruma-antirutschmatte-94452347734-1.jpg'}