Search code examples
pythonpostmechanizetwill

Python requests module : Post and go to next page


I'm filling a form on a web page using python's request module. I'm submitting the form as a POST request, which works fine. I get the expected response from the POST. However, it's a multistep form; after the first "submit" the site loads another form on the same page (using AJAX) . The post response has this HTML page . Now, how do I use this response to fill the form on the new page? Can I intertwine Requests module with Twill or Mechanize in some way?

Here's the code for the POST:

import requests
from requests.auth import HTTPProxyAuth
import formfill
from twill import get_browser
from twill.commands import *
import mechanize
from mechanize import ParseResponse, urlopen, urljoin

http_proxy  = "some_Proxy"
https_proxy  = "some_Proxy"

proxyDict = { 
              "http"  : http_proxy, 
              "https" : https_proxy
            }

auth = HTTPProxyAuth("user","pass")
r = requests.post("site_url",data={'key':'value'},proxies=proxyDict,auth=auth)

The response r above, contains the new HTML page that resulted from submitting that form. This HTML page also has a form which I have to fill. Can I send this r to twill or mechanize in some way, and use Mechanize's form filling API? Any ideas will be helpful.


Solution

  • The problem here is that you need to actually interact with the javascript on the page. requests, while being an excellent library has no support for javascript interaction, it is just an http library.

    If you want to interact with javascript-rich web pages in a meaningful way I would suggest selenium. Selenium is actually a full web browser that can navigate exactly as a person would.

    The main issue is that you'll see your speed drop precipitously. Rendering a web page takes a lot longer than the raw html request. If that's a real deal breaker for you you've got two options:

    • Go headless: There are many options here, but I personally prefer casper. You should see a ~3x speed up on browsing times by going headless, but every site is different.
    • Find a way to do everything through http: Most non-visual site functions have equivalent http functionality. Using the google developer tools network tab you can dig into the requests that are actually being launched, then replicate those in python.

    As far as the tools you mentioned, neither mechanize nor twill will help. Since your main issue here is javascript interaction rather than cookie management, and neither of those frameworks support javascript interactions you would run into the same issue.

    UPDATE: If the post response is actually the new page, then you're not actually interacting with AJAX at all. If that's the case and you actually have the raw html, you should simply mimic the typical http request that the form would send. The same approach you used on the first form will work on the second. You can either grab the information out of the HTML response, or simply hard-code the successive requests.