Search code examples
pythonscrapyscrapyd

Scrapy request chaining not working with Spider Middleware


Similar to what is done in the link: How can i use multiple requests and pass items in between them in scrapy python

I am trying to chain requests from spiders like in Dave McLain's answer. Returning a request object from parse function works fine, allowing the spider to continue with the next request.

    def parse(self, response):
        # Some operations

        self.url_index += 1
        if self.url_index < len(self.urls):
            return scrapy.Request(url=self.urls[self.url_index], callback=self.parse)
        return items

However, I have the default Spider Middleware where I do some caching and logging operations in the spider_process_output. Returning the request object from the parse function first goes into middleware. So, the middleware has to return the request object as well.

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.
        # Must return an iterable of Request, or item objects.

        if hasattr(spider, 'multiple_urls'):
            if spider.url_index + 1 < len(spider.urls):
                return [result]
                # return [scrapy.Request(url=spider.urls[spider.url_index], callback=spider.parse)]
        
        # Some operations ...

According to the documentation, it must return iterable of Request, or item objects. However, when I return the result (which is a Request object), or construct a new request object (as in the comment), the spider just terminates (by giving spider finished signal) without making a new request.

Documentation link: https://docs.scrapy.org/en/latest/topics/spider-middleware.html#writing-your-own-spider-middleware

I am not sure if there is an issue with the documentation or the way I interpret it. But, returning request objects from the middleware doesn't make new request, instead it terminates the flow.


Solution

  • It was quite simple yet frustrating to solve the problem. The middleware is supposed to return iterable of request objects. However, putting the request object into a list (which is an iterable) doesn't seem to work. Using yield result in the process_spider_output middleware function instead works.

    Since the main issue is resolved, I'll leave this answer as a reference. Better explanations of why this is the case are appreciated.