Search code examples
asp.netpythonmechanizescraperwiki

Screenscaping aspx with Python Mechanize - Javascript form submission


I'm trying to scrape UK Food Ratings Agency data aspx seach results pages (e.,g http://ratings.food.gov.uk/QuickSearch.aspx?q=po30 ) using Mechanize/Python on scraperwiki ( http://scraperwiki.com/scrapers/food_standards_agency/ ) but coming up with a problem when trying to follow "next" page links which have the form:

<input type="submit" name="ctl00$ContentPlaceHolder1$uxResults$uxNext" value="Next >" id="ctl00_ContentPlaceHolder1_uxResults_uxNext" title="Next >" />

The form handler looks like:

<form method="post" action="QuickSearch.aspx?q=po30" onsubmit="javascript:return WebForm_OnSubmit();" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ctl00_ContentPlaceHolder1_buttonSearch')" id="aspnetForm">
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__LASTFOCUS" id="__LASTFOCUS" value="" />

An HTTP trace when I manually click Next links shows __EVENTTARGET as empty? All the cribs I can find on other scrapers show the manipulation of __EVENTTARGET as the way of handling Next pages.

Indeed, I'm not sure how the page I want to scrape ever loads the next page? Whatever I throw at the scraper, it only ever manages to load the first results page. (Even being able to change the number of results per page would be useful, but I can't see how to do that either!)

So - any ideas on how to scrape the 1+N'th results pages for N>0?


Solution

  • Mechanize doesn´t handle javascript, but for this particular case it isn´t needed.

    First we open the result page with mechanize

    url = 'http://ratings.food.gov.uk/QuickSearch.aspx?q=po30'
    br = mechanize.Browser()
    br.set_handle_robots(False)
    br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
    br.open(url)
    response = br.response().read()
    

    Then we select the aspnet form:

    br.select_form(nr=0) #Select the first (and only) form - it has no name so we reference by number
    

    The form has 5 submit buttons - we want to submit the one that takes us to the next result page:

    response = br.submit(name='ctl00$ContentPlaceHolder1$uxResults$uxNext').read()  #"Press" the next submit button
    

    The other submit buttons in the form are:

    ctl00$uxLanguageSwitch # Switch language to Welsh
    ctl00$ContentPlaceHolder1$uxResults$Button1 # Search submit button
    ctl00$ContentPlaceHolder1$uxResults$uxFirst # First result page
    ctl00$ContentPlaceHolder1$uxResults$uxPrevious # Previous result page
    ctl00$ContentPlaceHolder1$uxResults$uxLast # Last result page
    

    In mechanize we can get form info like this:

    for form in br.forms():
        print form