python authentication web client twisted

Login to web page using twisted web

I want to write a simple web client using twisted, that logs into a web site with a username and password, and grabs some data from a given page. What is the best way to do this?

Edit: To add some more details: This is a simple username/password HTML form submission. There is a PHPSESSID cookie involved is this is a PHP site. No captchas. No HTTP authentication.

Solution

HTML form authentication is done by submitting the authentication form. This means knowing both the form action and method. For starters, you could manually read the page source and find out this information. A more general solution would involve parsing the page (with something like lxml or html5lib, probably) and extracting this information automatically.

You also need to know the names of the username and password fields in the form, as well as the names and correct values for any other mandatory form fields.

For example, a form that looks like this:

<form action="https://example.com/auth" method="post">
    <input type="text" name="Email" id="Email" value="">
    <input type="password" name="Password" id="Password" value="">
</form>

has a form action of https://example.com/auth and a method of post. So you need to issue a POST request to https://example.com/auth.

By convention, the encoding for the data in such a request uses application/x-www-form-urlencoded as its content-type.

You can encode the body for such a request using the Python stdlib urllib.urlencode.

Finally, if authentication success is represented as a cookie that must be re-presented with future requests, you need to make sure you capture the value of the cookie and re-send it.

So, putting this all together:

from twisted.web.client import getPage

cookies = {}
d = getPage(
    "https://example.com/auth",
    method="POST",
    headers={"content-type": "application/x-www-form-urlencoded"},
    postdata=urllib.urlencode(dict(Email="[email protected]", Password="secret")),
    cookies=cookies)

The cookie dictionary will be filled in with the value of any cookies set by the server. Pass it along with any future getPage calls you want to use the result of this authentication.

All that said, I like the recommendation to use scrapy. It will do a lot of this low-level stuff for you and let you focus on the more interesting part of your problem.