So I'm writing a web crawler to batch download PDFs from my university's website, as I don't fancy downloading them one by one.
I've got most the code working, using the 'requests' module. The issue is, you have to be signed in to a university account to access the PDFs, so I've set up requests to use cookies to sign into my university account before downloading the PDFs, however the HTML form to sign in on the university page is rather peculiar.
I've abstracted the HTML which can be found here:
<form action="/login" method="post">
<fieldset>
<div>
<label for="username">Username:</label>
<input id="username" name="username" type="text" value="" />
<label for="password">Password:</label>
<input id="password" name="password" type="password" value=""/>
<input type="hidden" name="lt" value="" />
<input type="hidden" name="execution" value="*very_long_encrypted_code*" />
<input type="hidden" name="_eventId" value="submit" />
<input type="submit" name="submit" value="Login" />
</div>
</fieldset>
</form>
Firstly the action
parameter in the form does not reference a PHP file which I don't understand. Is action="/login"
referencing the page itself, or http://www.blahblah/login/login
? (the HTML is taken from the page http://www.blahblah/login
.
Secondly, what's with all the 'hidden' inputs? I'm not sure how this page is taking the given login data and passing it to a PHP script.
This has led to the failure of the requests sign on in my python script:
import requests
user = input("User: ")
passw = input("Password: ")
payload = {"username" : user, "password" : passw}
s = requests.Session()
s.post(loginURL, data = payload)
r = s.get(url)
I would have thought this would take the login data and sign me into the page, but r
is just assigned the original logon page. I'm assuming it's to do with the strange PHP interation in the HTML. Any ideas what I need to change?
EDIT: Thought I'd also mention there is no javascript on the page at all. Purely HTML & CSS
What you are looking at is likely a CSRF token
The linked answer is very good, but a summary is, these tokens used to make sure that you can't send malicious requests to a site from another page in your web browser. In this case it is a bit silly, because logging in has no consequences. It was likely added automatically by the framework your university website uses.
You will have to extract this token from the login page before doing your login POST and then include it with your data.
The full steps would be the following:
Send the login request:
payload = {"username" : user, "password" : passw, "execution": token}