Search code examples
pythonpython-2.7scrapycontrol-flow

Scrapy instance method mysteriously refusing to call another instance method


I'm using Scrapy to scrape a site that has a login page followed by a set of content pages with sequential integer IDs, pulled up as a URL parameter. This has been successfully running for a while, but the other day I decided to move the code that yields the Requests into a separate method, so that I can call it other places besides the initial load (basically, to dynamically add some more pages to fetch).

And it... just won't call that separate method. It reaches the point where I invoke self.issue_requests(), and proceeds right through it as if the instruction isn't there.

So this (part of the spider class definition, without the separate method) works:

    # ...final bit of start_requests():
    yield scrapy.FormRequest(url=LOGIN_URL + '/login', method='POST', formdata=LOGIN_PARAMS, callback=self.parse_login)

def parse_login(self, response):
    self.logger.debug("Logged in successfully!")
    global next_priority, input_reqno, REQUEST_RANGE, badreqs

    # go through our desired page numbers
    while len(REQUEST_RANGE) > 0:
        input_reqno = int(REQUEST_RANGE.pop(0))

        if input_reqno not in badreqs:
            yield scrapy.Request(url=REQUEST_BASE_URL + str(input_reqno), method='GET', meta={'input_reqno': input_reqno,'dont_retry': True}, callback=self.parse_page, priority = next_priority)
            next_priority -= 1

def parse_page(self, response):
    # ...

...however, this slight refactor does not:

    # ...final bit of start_requests():
    yield scrapy.FormRequest(url=LOGIN_URL + '/login', method='POST', formdata=LOGIN_PARAMS, callback=self.parse_login)

def issue_requests(self):
    self.logger.debug("Inside issue_requests()!")
    global next_priority, input_reqno, REQUEST_RANGE, badreqs

    # go through our desired page numbers
    while len(REQUEST_RANGE) > 0:
        input_reqno = int(REQUEST_RANGE.pop(0))

        if input_reqno not in badreqs:
            yield scrapy.Request(url=REQUEST_BASE_URL + str(input_reqno), method='GET', meta={'input_reqno': input_reqno,'dont_retry': True}, callback=self.parse_page, priority = next_priority)
            next_priority -= 1
    return

def parse_login(self, response):
    self.logger.debug("Logged in successfully!")
    self.issue_requests()

def parse_page(self, response):
    # ...

Looking at the logs, it reaches the "logged in successfully!" part, but then never gets "inside issue_requests()", and because there are no scrapy Request objects yielded by the generator, its next step is to close the spider, having done nothing.

I've never seen a situation where an object instance just refuses to call a method. You'd expect there to be some failure message if it can't pass control to the method, or if there's (say) a problem with the variable scoping in the method. But for it to silently move on and pretend I never told it to go to issue_requests() is, to me, bizarre. Help!

(this is Python 2.7.18, btw)


Solution

  • You have to yield items from parse_login as well:

    def parse_login(self, response):
        self.logger.debug("Logged in successfully!")
        for req in self.issue_requests():
            yield req