I'm working on a project to scrape statistics from Fantasy Football leagues across various services, and Yahoo is the one I'm stuck at currently. I want my spider to crawl the Draft Results page of a public Yahoo league. When I run the spider, it gives me no results, and no error message either. It simply says:
2012-09-14 17:29:08-0700 [draft] DEBUG: Crawled (200) <GET http://football.fantasysports.yahoo.com/f1/753697/draftresults?drafttab=round> (referer: None)
2012-09-14 17:29:08-0700 [draft] INFO: Closing spider (finished)
2012-09-14 17:29:08-0700 [draft] INFO: Dumping spider stats:
{'downloader/request_bytes': 250,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 48785,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 9, 15, 0, 29, 8, 734000),
'scheduler/memory_enqueued': 1,
'start_time': datetime.datetime(2012, 9, 15, 0, 29, 7, 718000)}
2012-09-14 17:29:08-0700 [draft] INFO: Spider closed (finished)
2012-09-14 17:29:08-0700 [scrapy] INFO: Dumping global stats:
{}
It's not a login issue, because the page in question is accessible without being signed in. I see from other questions posted here that people have gotten scrapes to work for other parts of Yahoo. Is it possible that Yahoo Fantasy is blocking spiders? I've successfully written one for ESPN already, so I don't think the issue is with my code. Here it is anyway:
class DraftSpider(CrawlSpider):
name = "draft"
#psycopg stuff here
rows = ["753697"]
allowed_domains = ["football.fantasysports.yahoo.com"]
start_urls = []
for row in rows:
start_urls.append("http://football.fantasysports.yahoo.com/f1/" + "%s" % (row) + "/draftresults?drafttab=round")
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("/html/body/div/div/div/div/div/div/div/table/tr")
items = []
for site in sites:
item = DraftItem()
item['pick_number'] = site.select("td[@class='first']/text()").extract()
item['pick_player'] = site.select("td[@class='player']/a/text()").extract()
item['pick_nflteam'] = site.select("td[@class='player']/span/text()").extract()
item['pick_ffteam'] = site.select("td[@class='last']/@title").extract()
items.append(item)
return items
Would really appreciate any insight on this.
C:\Users\Akhter Wahab>scrapy shell http://football.fantasysports.yahoo.com/f1/75
In [1]: hxs.select("/html/body/div/div/div/div/div/div/div/table/tr")
Out[1]: []
your absolute Xpath is not right "/html/body/div/div/div/div/div/div/div/table/tr"
as well as i will never recommend you to use absolute Xpath , but you should use some relative xpath like all results are in
//div[@id='drafttables']
this div. so you can start getting results.