Search code examples
pythonpython-3.xweb-scrapingscrapyscrapy-pipeline

Can't use csv pipelines and images pipelines within a spider correctly


I'm trying to figure out any way to write first two fields in a csv file and to use the last two fields to download images in a folder simultaneously. I've created two custom pipelines to achieve that.

This is the spider:

import scrapy

class PagalWorldSpider(scrapy.Spider):
    name = 'pagalworld'
    start_urls = ['https://www.pagalworld.pw/indian-pop-mp3-songs-2021/files.html']

    custom_settings = {
        'ITEM_PIPELINES': {
            'my_project.pipelines.PagalWorldImagePipeline': 1,
            'my_project.pipelines.CSVExportPipeline': 300
        },
        'IMAGES_STORE': r"C:\Users\WCS\Desktop\Images",
    }

    def start_requests(self):
        for start_url in self.start_urls:
            yield scrapy.Request(start_url,callback=self.parse)

    def parse(self, response):
        for item in response.css(".files-list .listbox a[href]::attr(href)").getall():
            inner_page_link = response.urljoin(item)
            yield scrapy.Request(inner_page_link,callback=self.parse_download_links)

    def parse_download_links(self,response):
        title = response.css("h1.title::text").get()
        categories = ', '.join(response.css("ul.breadcrumb > li > a::text").getall())

        file_link = response.css(".file-details audio > source::attr(src)").get()
        image_link = response.urljoin(response.css(".alb-img-det > img[data-src]::attr('data-src')").get())
        image_name = file_link.split("-")[-1].strip().replace(" ","_").replace(".mp3","")
        
        yield {"Title":title,"categories":categories,"image_urls":[image_link],"image_name":image_name}

If I execute the script as is, I get all four fields in a csv file, the fields that I'm yielding within parse_download_links method. The script is also downloading and renaming images accurately.

The first two fields Title and categories are what I wish to write to the csv file, not image_urls and image_name. However, this two fields image_urls and image_name are meant to download and rename images.

How can I use both of the pipelines correctly?


Solution

  • You don't have to create a CSV pipeline just for this purpose. Read this.

    import scrapy
    
    
    class PagalWorldSpider(scrapy.Spider):
        name = 'pagalworld'
        start_urls = ['https://www.pagalworld.pw/indian-pop-mp3-songs-2021/files.html']
    
        custom_settings = {
            'ITEM_PIPELINES': {
                'my_project.pipelines.PagalWorldImagePipeline': 1,
                # 'my_project.pipelines.CSVExportPipeline': 300
            },
            'IMAGES_STORE':  r'C:\Users\WCS\Desktop\Images',
            'FEEDS': {
                r'file:///C:\Users\WCS\Desktop\output.csv': {'format': 'csv', 'overwrite': True}
            },
            'FEED_EXPORT_FIELDS': ['Title', 'categories']
        }
    
        def start_requests(self):
            for start_url in self.start_urls:
                yield scrapy.Request(start_url, callback=self.parse)
    
        def parse(self, response):
            for item in response.css(".files-list .listbox a[href]::attr(href)").getall():
                inner_page_link = response.urljoin(item)
                yield scrapy.Request(inner_page_link, callback=self.parse_download_links)
    
        def parse_download_links(self,response):
            title = response.css("h1.title::text").get()
            categories = ', '.join(response.css("ul.breadcrumb > li > a::text").getall())
    
            file_link = response.css(".file-details audio > source::attr(src)").get()
            image_link = response.urljoin(response.css(".alb-img-det > img[data-src]::attr('data-src')").get())
            image_name = file_link.split("-")[-1].strip().replace(" ", "_").replace(".mp3", "")
    
            yield {"Title": title, "categories": categories, "image_urls": [image_link], "image_name": image_name}
    

    Output:

    Heartfail - Mika Singh mp3 song Download PagalWorld.com,"Home, MUSIC, INDIPOP, Indian Pop Mp3 Songs 2021"
    Fakir - Hansraj Raghuwanshi mp3 song Download PagalWorld.com,"Home, MUSIC, INDIPOP, Indian Pop Mp3 Songs 2021"
    Humsafar - Suyyash Rai mp3 song Download PagalWorld.com,"Home, MUSIC, INDIPOP, Indian Pop Mp3 Songs 2021"
    ...
    ...
    ...
    

    EDIT:

    main.py:

    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    
    
    if __name__ == "__main__":
        spider = 'pagalworld'
        settings = get_project_settings()
        settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
        process = CrawlerProcess(settings)
        process.crawl(spider)
        process.start()