Search code examples
pythonscrapymiddleware

Scrapy: return FormRequest in Downloader Middleware


A website I am scraping sometimes redirects to a page with a form which I would like to handle in the Downloader Middleware. The idea is that every time this redirect occurs, it automatically submits the form and retrieve the results. My middleware looks something like:

from scrapy import FormRequest

class SubmitFormMiddleware:
    def process_response(self, request, response, spider):
        if response.css('form.loginbox').getall():
            post_form_url = response.css('form.loginbox::attr(action)').get()
            return FormRequest(url=response.urljoin(post_form_url),
                                     formdata={'username': 'my_username',
                                               'password': 'my_password',
                                               'data_selection': 'all'
                                               },
                                     method='POST',
                                     dont_filter=True)
        else:
            return response

This doesn't work since I don't have any callback defined (and I shouldn't because I am in middleware):

NotImplementedError: DefaultSpider.parse callback is not defined

If I wanted to just return a request I would have something like:

redirected = request.replace(url=response.urljoin(post_form_url))
return self._redirect(redirected, request, spider, response.status)

but this does not work for submitting a form. Does anybody know what the 'Scrapy-thonic' way is to use the FormRequest in a Downloader Middleware?


Solution

  • I managed to solve this problem in the following way:

    from scrapy import FormRequest
    
    class SubmitFormMiddleware:
        def process_response(self, request, response, spider):
            if response.css('form.loginbox').getall():
                post_form_url = response.css('form.loginbox::attr(action)').get()
                form_request_handle = FormRequest(url=response.urljoin(post_form_url),
                                         formdata={'username': 'my_username',
                                                   'password': 'my_password',
                                                   'data_selection': 'all'
                                                   },
                                         method='POST',
                                         dont_filter=True)
                return request.replace(url=form_request_handle.url,
                                         method='POST',
                                         body=form_request_handle.body,
                                         headers=form_request_handle.headers,
                                         dont_filter=True)
            else:
                return response
    

    Although this works, I am still curious about the 'scrapy-thonic' way to solve submit a FormRequest in the middleware.