I followed the scrapy-zyte-smartproxy documentation to integrate proxy usage into my spider. Now my spider can't log in.
To make this done we have to use crawlera sessions, furthermore, we need to disable crawlera cookies. There's an old PR but it's still not merged and doesn't work. You need to create your own scrapy middleware in your_project/middleware.py
file to attache crawlera headers for each spider request.
from scrapy import Request
class ZyteSmartProxySessionMiddleware(object):
def process_spider_output(self, response, result, spider):
def _set_session(request_or_item):
if not isinstance(request_or_item, Request):
return request_or_item
request = request_or_item
header = b'X-Crawlera-Session'
session = response.headers.get(header)
error = response.headers.get(b'X-Crawlera-Error')
session_is_bad = error == b'bad_session_id'
if session is not None and not session_is_bad:
request.headers[header] = session
request.headers['X-Crawlera-Cookies'] = 'disable'
return request
return (_set_session(request_or_item)
for request_or_item in result or ())
Enable this middleware in your settings.py
file.
SPIDER_MIDDLEWARES = {
'your_project.middlewares.ZyteSmartProxySessionMiddleware': True,
}
To start the session attach X-Crawlera-Session: create
header to the login request inside your scrapy spider.
def parse(self, response):
auth_data = {'username': self.user, 'password': self.password}
request = FormRequest.from_response(response, formdata=auth_data,
callback=self.redirect_to_select)
request.headers.setdefault('X-Crawlera-Session', 'create')
return request
Note that according to the documentation the spider will be slowed down after that.
There is a default delay of 12 seconds between each request using the same IP. These delays can differ for more popular domains.