Search code examples
scrapyforumweb-crawler

Scrapy forum crawler starting but not returning any scraped data


Here is my code, can someone please help, for some reason the spider runs but does not actually crawl the forum threads. I am trying to extract all the text in the forum threads for the specific forum in my start url.

from scrapy.spider import BaseSpider

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.selector import HtmlXPathSelector

from xbox.items import xboxItem

from scrapy.item import Item
from scrapy.conf import settings


class xboxSpider(CrawlSpider):
    name = "xbox"
    allowed_domains = ["forums.xbox.com"]
    start_urls= [
    "http://forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/default.aspx",
    ]
    rules= [
        Rule(SgmlLinkExtractor(allow=['/t/\d+']),callback='parse_thread'),  
        Rule(SgmlLinkExtractor(allow=('/t/new\?new_start=\d+',)))
            ]


    def parse_thread(self, response):
        hxs=HtmlXPathSelector(response)

        item=xboxItem()
        item['content']=hxs.selec("//div[@class='post-content user-defined-markup']/p/text()").extract()
        item['date']=hxs.select("//span[@class='value']/text()").extract()
        return item

Log output:

2013-03-13 11:22:18-0400 [scrapy] DEBUG: Enabled item pipelines: 
2013-03-13 11:22:18-0400 [xbox] INFO: Spider opened 
2013-03-13 11:22:18-0400 [xbox] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-03-13 11:22:20-0400 [xbox] DEBUG: Crawled (200) <GET forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/…; (referer: None) 
2013-03-13 11:22:20-0400 [xbox] DEBUG: Filtered offsite request to 'forums.xbox.com': <GET forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/…; 
2013-03-13 11:22:20-0400 [xbox] INFO: Closing spider (finished) 
2013-03-13 11:22:20-0400 [xbox] INFO: Dumping spider stats

Solution

  • As a first tweak, you need to modify your first rule by putting a "." at the start of the regex, as follows. I also changed the start url to the actual first page of the forum.

    start_urls= [
    "http://forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/310.aspx",
    ]
    rules = (
        Rule(SgmlLinkExtractor(allow=('./t/\d+')), callback="parse_thread", follow=True),
        Rule(SgmlLinkExtractor(allow=('./310.aspx?PageIndex=\d+')), ),
        )
    

    I've updated the rules so that the spider now crawls all of the pages in the thread.

    EDIT: I've found a typo that may be causing an issue, and I've fixed the date xpath.

     item['content']=hxs.selec("//div[@class='post-content user-defined-markup']/p/text()").extract()
     item['date']=hxs.select("(//div[@class='post-author'])[1]//a[@class='internal-link view-post']/text()").extract()
    

    The line above says "hxs.selec" and should be "hxs.select". I changed that and could now see content being scraped. Through trial and error (I'm a bit rubbish with xpaths), I've managed to get the date of the first post (ie the date the thread was created) so this should all work now.