Here is my code, can someone please help, for some reason the spider runs but does not actually crawl the forum threads. I am trying to extract all the text in the forum threads for the specific forum in my start url.
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from xbox.items import xboxItem
from scrapy.item import Item
from scrapy.conf import settings
class xboxSpider(CrawlSpider):
name = "xbox"
allowed_domains = ["forums.xbox.com"]
start_urls= [
"http://forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/default.aspx",
]
rules= [
Rule(SgmlLinkExtractor(allow=['/t/\d+']),callback='parse_thread'),
Rule(SgmlLinkExtractor(allow=('/t/new\?new_start=\d+',)))
]
def parse_thread(self, response):
hxs=HtmlXPathSelector(response)
item=xboxItem()
item['content']=hxs.selec("//div[@class='post-content user-defined-markup']/p/text()").extract()
item['date']=hxs.select("//span[@class='value']/text()").extract()
return item
Log output:
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Enabled item pipelines:
2013-03-13 11:22:18-0400 [xbox] INFO: Spider opened
2013-03-13 11:22:18-0400 [xbox] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-03-13 11:22:20-0400 [xbox] DEBUG: Crawled (200) <GET forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/…; (referer: None)
2013-03-13 11:22:20-0400 [xbox] DEBUG: Filtered offsite request to 'forums.xbox.com': <GET forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/…;
2013-03-13 11:22:20-0400 [xbox] INFO: Closing spider (finished)
2013-03-13 11:22:20-0400 [xbox] INFO: Dumping spider stats
As a first tweak, you need to modify your first rule by putting a "." at the start of the regex, as follows. I also changed the start url to the actual first page of the forum.
start_urls= [
"http://forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/310.aspx",
]
rules = (
Rule(SgmlLinkExtractor(allow=('./t/\d+')), callback="parse_thread", follow=True),
Rule(SgmlLinkExtractor(allow=('./310.aspx?PageIndex=\d+')), ),
)
I've updated the rules so that the spider now crawls all of the pages in the thread.
EDIT: I've found a typo that may be causing an issue, and I've fixed the date xpath.
item['content']=hxs.selec("//div[@class='post-content user-defined-markup']/p/text()").extract()
item['date']=hxs.select("(//div[@class='post-author'])[1]//a[@class='internal-link view-post']/text()").extract()
The line above says "hxs.selec" and should be "hxs.select". I changed that and could now see content being scraped. Through trial and error (I'm a bit rubbish with xpaths), I've managed to get the date of the first post (ie the date the thread was created) so this should all work now.