I'm running a Scrapy spider in python to scrape images from a website. After trying some other methods, I'm attempting to implement an ImagesPipeline for doing this.
items.py
class NHTSAItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
settings.py:
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = 'C:\Users\me\Desktop'
myspider.py
def parse_photo_page(self, response):
item = NHTSAItem()
for sel in response.xpath('//table[@id="tblData"]/tr'):
url = sel.xpath('td/font/a/@href').extract()
table_fields = sel.xpath('td/font/text()').extract()
if url:
base_url_photo = "http://www-nrd.nhtsa.dot.gov"
full_url = base_url_photo + url[0]
if not item:
item['image_urls'] = [full_url]
else:
item['image_urls'].append(full_url)
return item
There are no errors that come up, the images just don't get downloaded. The debugger even says "Scraped" Here's the log:
DEBUG: Scraped from <200 http://www-nrd.nhtsa.dot.gov/database/VSR/veh/../SearchMedia.aspx?database=v&tstno=4000&mediatype=p&p_tstno=4000>
{'image_urls': [u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=1&database=V&type=P',
u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=2&database=V&type=P',
u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=3&database=V&type=P',
u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=4&database=V&type=P',
u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=5&database=V&type=P']}
I don't care about extending the pipeline (making a custom pipeline), the default imagespipeline is fine. The images are nowhere to be found. Any ideas what I'm doing wrong?
Here's the solution, which came to me from this parallel question: Scrapy: Error 10054 after retrying image download (Thanks to @neverlastn)
I simply added this snippet to my actual spider.py file.
custom_settings = {
"ITEM_PIPELINES": {'scrapy.pipelines.images.ImagesPipeline': 1},
"IMAGES_STORE": saveLocation
}
I think it wasn't properly referencing my settings.py file, and therefore didn't activate the image pipeline. I'm not sure how to get it to accurately reference my settings file, but this solution is good enough for me!