Search code examples
pythontwill

Logging into a website and retrieving HTML with Python


I need to log into a website to access its html on a login-protected page for a project I'm doing.

I'm using this person's answer with the values I need:

from twill.commands import *
go('https://example.com/login')

fv("3", "email", "[email protected]")
fv("3", "password", "mypassword")

submit()

Assumedly this should log me in so I then run:

sock = urllib.urlopen("https://www.example.com/activities")
html_source = sock.read()
sock.close()
print html_source

Which I thought would print the html of the (now) accessible page but instead just gives me the html of the login page. I've tried other methods (e.g. with mechanize) but I get the identical result.

What am I missing? Do some sites restrict this type of login or does it not work with https or something? (The site is FitBit, since I couldn't use the url in the question)


Solution

  • You're using one library to log in and another to then retrieve the subsequent page. twill and urllib are not sharing data about your sessions. (Similar issue to this one.) If you do that, then you need to manage the session cookie / authentication yourself. Specifically, you'll need to copy the cookie + data and add that to the post-login request in the other library.

    Otherwise, and more logically, use the same one for both the login and post-login requests.