I'm looking to learn more about how Scrapy can be used to login to websites. I looked at some documentations and tutorials and ended up at Using FormRequest.from_response() to simulate a user login. Using Chrome dev tools, I look at the "login" response after logging in from the page https://eventbrite.ca/signin/login.
Some things that may be important to note is that when attempting to login in browser, the web page will direct you to https://eventbrite.ca/signin, where you enter your email and submit the form.
This sends a POST request to https://www.eventbrite.ca/api/v3/users/lookup/ with just the email provided, and if all is dandy, the webpage will use JS to "redirect" you to https://eventbrite.ca/signin/login and generate the "password" input element.
Once you fill your password and hit the form button, if successful, it will then redirect+generate the login response as a result of POST sent to https://www.eventbrite.ca/ajax/login/ with email, pw, and some other info (which can be found in my code snippet).
First I tried doing it step by step: going from .ca/signup, sending a POST with my email to the lookup endpoint, but I get a 401 error. Next I tried directly going to .ca/signup/login, and submitting all the info found in the login response, but receive 403.
I'm sure I must be missing something, though it seems I am POSTing to the correct URLs and finding the correct form, but can't figure out what's left. Also after trying this for a while, wondering if Selenium would provide a better alternative for logging in and doing some automation on a web page that has loads of JS. Any help appreciated.
def login(self, response):
yield FormRequest.from_response(
response,
formxpath="//form[(@novalidate)]",
url='https://www.eventbrite.ca/ajax/login/',
formdata={
'email': '[email protected]',
'password': 'password',
'forward':'',
'referrer': '/',
'pckg': '',
'stld': ''
},
callback=self.begin_event_parse
)
.ca/signup/login attempt (403):
[scrapy.core.engine] DEBUG: Crawled (403) <POST https://www.eventbrite.ca/ajax/login/> (referer: https://www.eventbrite.ca/signin/login)
.ca/signup attempt (401):
[scrapy.core.engine] DEBUG: Crawled (401) <POST https://www.eventbrite.ca/api/v3/users/lookup/> (referer: https://www.eventbrite.ca/signin/login)
It looks like you are missing the X-CSRFToken
in your headers. This token is used to protect the resource from Cross-site Request Forgery.
In this case, it is provided in the cookies, and you need to store it and pass it along.
A simple implementation that works for me:
import re
import scrapy
class DarazspidySpider(scrapy.Spider):
name = 'darazspidy'
def start_requests(self):
yield scrapy.Request('https://www.eventbrite.ca/signin/?referrer=%2F%3Finternal_ref%3Dlogin%26internal_ref%3Dlogin%26internal_ref%3Dlogin', callback=self.lookup)
def lookup(self, response):
yield scrapy.FormRequest(
'https://www.eventbrite.ca/api/v3/users/lookup/',
formdata={"email":"[email protected]"},
headers={'X-CSRFToken': self._get_xcsrf_token(response),},
callback=self.login,
)
def _get_xcsrf_token(self, response):
cookies = response.headers.getlist('Set-Cookie')
cookie, = [c for c in cookies if 'csrftoken' in str(c)]
self.token = re.search(r'csrftoken=(\w+)', str(cookie)).groups()[0]
return self.token
def login(self, response):
yield scrapy.FormRequest(
url='https://www.eventbrite.ca/ajax/login/',
formdata={
'email': '[email protected]',
'password': 'pwd',
'forward':'',
'referrer': '/?internal_ref=login&internal_ref=login',
'pckg': '',
'stld': ''
},
callback=self.parse,
headers={'X-CSRFToken': self.token}
)
def parse(self, response):
self.logger.info('Logged in!')
Ideally, you'd want to create a middleware to do that for you.
Generally, when you face this kind of behavior, you want to try to mimic what the browser is sending as close as possible, so look at the headers closely and try to replicate them.