Search code examples
pythonscrapycrawlera

How to authenticate using scrapy spider with Zyte Smart Proxy Manager (former Crawlera) enabled?


I followed the scrapy-zyte-smartproxy documentation to integrate proxy usage into my spider. Now my spider can't log in.


Solution

  • To make this done we have to use crawlera sessions, furthermore, we need to disable crawlera cookies. There's an old PR but it's still not merged and doesn't work. You need to create your own scrapy middleware in your_project/middleware.py file to attache crawlera headers for each spider request.

    from scrapy import Request
    
    
    class ZyteSmartProxySessionMiddleware(object):
        def process_spider_output(self, response, result, spider):
            def _set_session(request_or_item):
                if not isinstance(request_or_item, Request):
                    return request_or_item
    
                request = request_or_item
                header = b'X-Crawlera-Session'
                session = response.headers.get(header)
                error = response.headers.get(b'X-Crawlera-Error')
                session_is_bad = error == b'bad_session_id'
    
                if session is not None and not session_is_bad:
                    request.headers[header] = session
                    request.headers['X-Crawlera-Cookies'] = 'disable'
    
                return request
    
            return (_set_session(request_or_item)
                    for request_or_item in result or ())
    

    Enable this middleware in your settings.py file.

    SPIDER_MIDDLEWARES = {
        'your_project.middlewares.ZyteSmartProxySessionMiddleware': True,
    }
    

    To start the session attach X-Crawlera-Session: create header to the login request inside your scrapy spider.

    def parse(self, response):
        auth_data = {'username': self.user, 'password': self.password}
        request = FormRequest.from_response(response, formdata=auth_data,
                                            callback=self.redirect_to_select)
        request.headers.setdefault('X-Crawlera-Session', 'create')
        return request
    

    Note that according to the documentation the spider will be slowed down after that.

    There is a default delay of 12 seconds between each request using the same IP. These delays can differ for more popular domains.