Search code examples
pythonhyperlinkscrapyextractor

Scrapy crawl extracted links


I need to crawl a website, and crawl every url from that site on a specific xpath for example.: I need to crawl "http://someurl.com/world/" which has 10 links in the container (xpath("//div[@class='pane-content']")) and i need to crawl all those 10 links and extract images from them, but the links in "http://someurl.com/world/" look like "http://someurl.com/node/xxxx"

what i have till now:

import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from imgur.items import ImgurItem

class ImgurSpider(CrawlSpider):
    name = 'imgur'
    allowed_domains = ['someurl.com/']
    start_urls = ['http://someurl.com/news']
    rules = [Rule(LinkExtractor(allow=('/node/.*')), callback='parse_imgur', follow=True)]

    def parse_imgur(self, response):
        image = ImgurItem()
        image['title'] = response.xpath(\
            "//h1[@class='pane-content']/a/text()").extract()
        rel = response.xpath("//img/@src").extract()
        image['image_urls'] = response.xpath("//img/@src").extract()
        return image

Solution

  • You can rewrite your 'Rule' to accommodate for all your requirements as :

    rules = [Rule(LinkExtractor(allow=('/node/.*',), restrict_xpaths=('//div[@class="pane-content"]',)), callback='parse_imgur', follow=True)]
    

    To download images from the extracted image links you can make use of Scrapy's bundled ImagePipeline