Search code examples
pythonscrapy

Scrapy get downloaded file name


I'm new to Scrapy, please bear with me.

I have a spider that visits a page, and downloads a file. Ultimately I want to write the name of the file, along with other useful information to a db table.

--> Right now, I am struggling to get the file name:

from items.py:

import scrapy
from scrapy.item import Item, Field

class NdrItem(scrapy.Item):
    district = Field()
    file_urls = Field()
    file_name = Field()
    files = Field()

from spider:

import scrapy
from ndr.items import NdrItem

class CentralBedfordshireSpider(scrapy.Spider):
    name = 'central_bedfordshire2'
    allowed_domains = ['centralbedfordshire.gov.uk']
    start_urls = ['http://centralbedfordshire.gov.uk/business/rates/paying/published.aspx']

    def parse(self, response):

        relative_url = response.xpath("//article[@class='page-content__article']/div[@class='editor']/p[3]/a/@href").extract_first()
        download_url = response.urljoin(relative_url)
        item = NdrItem()
        item['district'] = 'central bedfordshire'
        item['file_urls'] = [download_url]
        print('------------------ Print the info I want to eventually go in db --------------------------')
        print(item['district'])
        print(item['files'])
    return item

Edit: The file is downloading ok and has the sha1 filename when downloaded. I would like to have the sha1 filename.

Edit: I get the following error when I run this spider:

2017-08-22 10:39:42 [scrapy.core.scraper] ERROR: Spider error processing <GET http://centralbedfordshire.gov.uk/business/rates/paying/published.aspx> (referer: None)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\MichaelAnderson\GDrive\Python\ndr\ndr\spiders\central_bedfordshire2.py", line 19, in parse
    print(item['files'])
  File "c:\python27\lib\site-packages\scrapy\item.py", line 59, in __getitem__
    return self._values[key]
KeyError: 'files'

Typically if people have a number of spiders all saving data to the same folder, how do people reference downloaded files and keep them linked to the source URL?

Many thanks for any help


Solution

  • For your specific requirement, I would probably use Scrapy Files Pipeline together with custom pipeline ordered after the Files Pipeline. From the Files Pipeline documentation:

    When the files are downloaded, another field (files) will be populated with the results. This field will contain a list of dicts with information about the downloaded files, such as the downloaded path, the original scraped url (taken from the file_urls field) , and the file checksum. The files in the list of the files field will retain the same order of the original file_urls field. If some file failed downloading, an error will be logged and the file won’t be present in the files field.

    In your spider, populate the field file_urls with file locations you wish to download. Then, after processing the item with standard Files Pipeline, it will contain field files with SHA1 filenames for each of the location in file_urls, in the same order. Then write another custom pipeline, which will process items after Files Pipeline and will use this information.