I'm currently working on a student's data scientist project which consist of building a fish recognition system by picture. We will use tensorflow to make sense from data & scrapy to find a massive amount of data (fish picture & his scientific name).
I'm new to scrapy, but I've been working a lot since 3 days, I 've written a basic fishbase spider (you'll find the url in the spider's code):
import scrapy
from ..items import FishbaseItem
class FishbaseSpider(scrapy.Spider):
name = 'fishbase'
allowed_domains = ['fishbase.org']
start_urls = [
'http://fishbase.org/ListByLetter/ScientificNamesQ.htm',
]
def parse(self, response):
all_fish = response.xpath('//tbody/tr')
for fish in all_fish:
taxo = fish.xpath('td/a/i/text()').extract()
fish_url = fish.xpath('td/a/@href').extract_first()
item = FishbaseItem()
item['taxonomy'] = taxo
r=scrapy.Request(url=response.urljoin(fish_url),callback=self.parseFish)
r.meta['item'] = item
yield r
def parseFish(self, response):
item = response.meta['item']
imgUrl = response.xpath('//div/span/div/a/img/@src').extract_first()
item['img_urls'] = response.urljoin(imgUrl)
yield item
Here is the item file :
import scrapy
class FishbaseItem(scrapy.Item):
taxonomy = scrapy.Field()
fish_url = scrapy.Field()
img_urls = scrapy.Field()
and the setting file :
BOT_NAME = 'fishbase'
SPIDER_MODULES = ['fishbase.spiders']
NEWSPIDER_MODULE = 'fishbase.spiders'
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = 'tmp/images/'
ROBOTSTXT_OBEY = True
I'm getting the results I want, but the images won't download. I don't understand why... Plus, I've downloaded a buck of images from other sites.
There are two problems:
image_urls
, not img_urls
(unless you override the IMAGES_URLS_FIELD
setting).item['img_urls'] = response.urljoin(imgUrl)
.