I'm new to Scrapy, please bear with me.
I have a spider that visits a page, and downloads a file. Ultimately I want to write the name of the file, along with other useful information to a db table.
--> Right now, I am struggling to get the file name:
from items.py:
import scrapy
from scrapy.item import Item, Field
class NdrItem(scrapy.Item):
district = Field()
file_urls = Field()
file_name = Field()
files = Field()
from spider:
import scrapy
from ndr.items import NdrItem
class CentralBedfordshireSpider(scrapy.Spider):
name = 'central_bedfordshire2'
allowed_domains = ['centralbedfordshire.gov.uk']
start_urls = ['http://centralbedfordshire.gov.uk/business/rates/paying/published.aspx']
def parse(self, response):
relative_url = response.xpath("//article[@class='page-content__article']/div[@class='editor']/p[3]/a/@href").extract_first()
download_url = response.urljoin(relative_url)
item = NdrItem()
item['district'] = 'central bedfordshire'
item['file_urls'] = [download_url]
print('------------------ Print the info I want to eventually go in db --------------------------')
print(item['district'])
print(item['files'])
return item
Edit: The file is downloading ok and has the sha1 filename when downloaded. I would like to have the sha1 filename.
Edit: I get the following error when I run this spider:
2017-08-22 10:39:42 [scrapy.core.scraper] ERROR: Spider error processing <GET http://centralbedfordshire.gov.uk/business/rates/paying/published.aspx> (referer: None)
Traceback (most recent call last):
File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\MichaelAnderson\GDrive\Python\ndr\ndr\spiders\central_bedfordshire2.py", line 19, in parse
print(item['files'])
File "c:\python27\lib\site-packages\scrapy\item.py", line 59, in __getitem__
return self._values[key]
KeyError: 'files'
Typically if people have a number of spiders all saving data to the same folder, how do people reference downloaded files and keep them linked to the source URL?
Many thanks for any help
For your specific requirement, I would probably use Scrapy Files Pipeline together with custom pipeline ordered after the Files Pipeline. From the Files Pipeline documentation:
When the files are downloaded, another field (
files
) will be populated with the results. This field will contain a list of dicts with information about the downloaded files, such as the downloaded path, the original scraped url (taken from thefile_urls
field) , and the file checksum. The files in the list of thefiles
field will retain the same order of the originalfile_urls
field. If some file failed downloading, an error will be logged and the file won’t be present in thefiles
field.
In your spider, populate the field file_urls
with file locations you wish to download. Then, after processing the item with standard Files Pipeline, it will contain field files
with SHA1 filenames for each of the location in file_urls
, in the same order. Then write another custom pipeline, which will process items after Files Pipeline and will use this information.