Search code examples
pythonscreen-scrapingyahooscrapy

Problem using scrapy to scrape a yahoo group


I'm new to web scraping and just started experimenting with Scrapy, a scraping framework written in Python. My goal is to scrape an old Yahoo Group since they don't provide an API or any other means to retrieve message archives. The Yahoo Group is set such that you have to log in before you can view the archives.

The steps I need to accomplish, I think, are:

  1. Log into yahoo
  2. Visit the URL for the first message and scrape it
  3. Repeat step 2 for the next message, etc

I started roughing out a scrapy spider to accomplish the above, and here is what I have so far. All I want to observe is that the login works and I am able to retrieve the first message. I'll finish the rest once I get this much working:

class Sg101Spider(BaseSpider):
    name = "sg101"
    msg_id = 1              # current message to retrieve
    max_msg_id = 21399      # last message to retrieve

    def start_requests(self):
        return [FormRequest(LOGIN_URL,
            formdata={'login': LOGIN, 'passwd': PASSWORD},
            callback=self.logged_in)]

    def logged_in(self, response):
        if response.url == 'http://my.yahoo.com':
            self.log("Successfully logged in. Now requesting 1st message.")
            return Request(MSG_URL % self.msg_id, callback=self.parse_msg,
                    errback=self.error)
        else:
            self.log("Login failed.")

    def parse_msg(self, response):
        self.log("Got message!")
        print response.body

    def error(self, failure):
        self.log("I haz an error")

When I run the spider though, I see it login and issue the request for the first message. However, all I see in the debug output from scrapy is 3 redirects, eventually arriving at the URL I asked for in the first place. But scrapy does not call my parse_msg() callback, and the crawling stops. Here is a snippet of the scrapy output:

2011-02-03 19:50:10-0600 [sg101] INFO: Spider opened
2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (302) to <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com> from <POST https://login.yahoo.com/config/login>
2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (meta refresh) to <GET http://my.yahoo.com> from <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com>
2011-02-03 19:50:12-0600 [sg101] DEBUG: Crawled (200) <GET http://my.yahoo.com> (referer: None)
2011-02-03 19:50:12-0600 [sg101] DEBUG: Successfully logged in. Now requesting 1st message.
2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1>
2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1>
2011-02-03 19:50:13-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1>
2011-02-03 19:50:13-0600 [sg101] INFO: Closing spider (finished)
2011-02-03 19:50:13-0600 [sg101] INFO: Spider closed (finished)

I am unable to make sense of this. It looks like Yahoo is redirecting the spider (maybe for auth checking?) but it seems to arrive back at the URL I wanted to visit in the first place. But scrapy doesn't call my callback and I don't get a chance to scrape the data or continue crawling.

Does anyone have any ideas on what is happening and/or how to debug this further? Thanks!


Solution

  • I think Yahoo is redirecting for an authorization check, and it finally redirects me back to the page I really wanted to get. Scrapy has already seen this request, however, and stops because it doesn't want to get into a loop. The solution, in my case, is to add dont_filter=True to the Request constructor. This will instruct Scrapy to not filter out duplicate requests. This is fine in my case, because I know in advance what URLs I want to crawl.

    def logged_in(self, response):
        if response.url == 'http://my.yahoo.com':
            self.log("Successfully logged in. Now requesting message page.",
                    level=log.INFO)
            return Request(MSG_URL % self.msg_id, callback=self.parse_msg,
                    errback=self.error, dont_filter=True)
        else:
            self.log("Login failed.", level=log.CRITICAL)