Search code examples
pythonweb-crawlerscrapysgml

SgmlLinkExtractor not displaying results or following link


I am having problems fully understanding how SGML Link Extractor works. When making a crawler with Scrapy, I can successfully extract data from links using specific URLS. The problem is using Rules to follow a next page link in a particular URL.

I think the problem lies in the allow() attribute. When the Rule is added to the code, the results do not display in the command line and the link to the next page is not followed.

Any help is greatly appreciated.

Here is the code...

import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider
from scrapy.contrib.spiders import Rule

from tutorial.items import TutorialItem

class AllGigsSpider(CrawlSpider):
    name = "allGigs"
    allowed_domains = ["http://www.allgigs.co.uk/"]
    start_urls = [
        "http://www.allgigs.co.uk/whats_on/London/clubbing-1.html",
        "http://www.allgigs.co.uk/whats_on/London/festivals-1.html",
        "http://www.allgigs.co.uk/whats_on/London/comedy-1.html",
        "http://www.allgigs.co.uk/whats_on/London/theatre_and_opera-1.html",
        "http://www.allgigs.co.uk/whats_on/London/dance_and_ballet-1.html"
    ]    
    rules = (Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//div[@class="more"]',)), callback="parse_me", follow= True),
    )

    def parse_me(self, response):
        hxs = HtmlXPathSelector(response)
        infos = hxs.xpath('//div[@class="entry vevent"]')
        items = []
        for info in infos:
            item = TutorialItem()
            item ['artist'] = hxs.xpath('//span[@class="summary"]//text()').extract()
            item ['date'] = hxs.xpath('//abbr[@class="dtstart dtend"]//text()').extract()
            item ['endDate'] = hxs.xpath('//abbr[@class="dtend"]//text()').extract()            
            item ['startDate'] = hxs.xpath('//abbr[@class="dtstart"]//text()').extract()
            items.append(item)
        return items
        print items

Solution

  • The problem is in the restrict_xpaths - it should point to a block where a link extractor should look for links. Don't specify allow at all:

    rules = [
        Rule(SgmlLinkExtractor(restrict_xpaths='//div[@class="more"]'), 
             callback="parse_me", 
             follow=True),
    ]
    

    And you need to fix your allowed_domains:

    allowed_domains = ["www.allgigs.co.uk"]
    

    Also note that the print items in the parse_me() callback is not reachable since it lies after the return statement. And, in the loop, you should not apply XPath expression using hxs, the expressions should be used in the info context. And you can simplify the parse_me():

    def parse_me(self, response):
        for info in response.xpath('//div[@class="entry vevent"]'):
            item = TutorialItem()
            item['artist'] = info.xpath('.//span[@class="summary"]//text()').extract()
            item['date'] = info.xpath('.//abbr[@class="dtstart dtend"]//text()').extract()
            item['endDate'] = info.xpath('.//abbr[@class="dtend"]//text()').extract()            
            item['startDate'] = info.xpath('.//abbr[@class="dtstart"]//text()').extract()
            yield item