Scrapy : List all links and infos contained in same page from a website

I have the following mini basic spider I use to get all links from a website.

from scrapy.item import Field, Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor


class SampleItem(Item):
    link = Field()


class SampleSpider(CrawlSpider):
    name = "sample_spider"
    allowed_domains = ["example.com"]
    start_urls = ["http://www.example.com/"]

    rules = (
        Rule(LinkExtractor(), callback='parse_page', follow=True),
    )

    def parse_page(self, response):
        item = SampleItem()
        item['link'] = response.url
        return item

I was wondering wether it would be possible to add to have this same spider scraping some html (like the one below)from these same links and to list link and info in a csv in two separate columns?

<span class="price">50,00&nbsp;€</span>

marko

Solution

Yes, that's possible of course. First of all you need to use a feed export. This can be set in the settings.py with the options:

FEED_FORMAT = 'csv'
FEED_URL = 'file:///absolute/path/to/the/output.csv'

Then you will have to adjust your items to allow more elements. Currently, you only use the link. You will want to add a price field.

class SampleItem(Item):
    link = Field()
    price = Field()

One sidenote: Usually we define items in the items.py file, because generally multiple spiders should scrape the same type of item from several pages. You would then import them into your spider using from scrapername.items import SampleItem. An example application for this would be a price scraper which scrapes both Amazon and some smaller shops.

Finally, you will have to adjust the parse_page method of your spider. Currently you only save the URL into your item. You want to find the price and also save it. Finding numbers or texts on a page is a key element of scraping. For this purpose we have selectors. Scapy supports XPath, CSS and regular expression selectors. The first two are especially useful, because they can be nested. Regular expressions would generally be used when you found the correct HTML element, but there is too much information within one element.

A problem you might encounter is that a page might have multiple .price elements. Have you made sure there only is one? Otherwise the selector will give you all of them and you might have to refine your selector using more other tags.

So, let's assume there is only this one .price element and construct our selector. We use CSS selector here, because it's more intuitive in this case. You can call the selectors directly on the response using css and xpath methods. Both of them always return elements on which you might use css() and xpath() again. To get the textual representation you need to call extract() on them. This might be annoying at the beginning, but nesting selectors is very convenient. Note that the selectors give you the full HTML element including the tag. To only get the text content, you need to make this explicit. For CSS selectors via ::text, for XPath via /text().

def parse_page(self, response):
    item = SampleItem()
    item['link'] = response.url
    try:
        item['price'] = response.css('.price::text')[0].extract()
    except IndexError:
        # do whatever is best if price cannot be found
        item['price'] = None
    return item