I have to scrape all info for former US governors from this site. However, to read out the results and then follow the links, I need to access the different results pages, or, preferably, simply set the results limit shown per page to the maximum of 100 (I don't think there are more than 100 results for each state). However, the page info seems to use javascript, is not part of a form and it seems I cannot access it as a control.
Any info on how to proceed? I am pretty new to python, only use it for tasks like this from time to time. This is some simple code which iterates through the main form.
import mechanize
import lxml.html
import csv
site = "http://www.nga.org/cms/FormerGovBios"
output = csv.writer(open(r'output.csv','wb'))
br = mechanize.Browser()
response = br.open(site)
br.select_form(name="governorsSearchForm")
states = br.find_control(id="states-field", type="select").items
for pos, item in enumerate(states[1:2]):
statename = str([label.text for label in item.get_labels()])
print pos, item.name, statename, len(states)
br.select_form(name="governorsSearchForm")
br["state"] = [item.name]
response = br.submit(name="submit", type="submit")
# now set page limit to 100, get links and descriptions\
# and follow each link to get information
for form in br.forms():
print "Form name:", form.name
print form, "\n"
for link in br.links():
print link.text, link.url
Ok this is a screwball approach. Playing around with the different search setting I found that the number of results to display is in the url. So I changed it to 3000 per page, thus it all fits on 1 page.
After it lodes which does take a while I'd right click and go to view page source. Copy that into a text file on my computer. Then I can scrape the info I need from the file without going to the server and having to process the javascript.
May I recommend "BeautifulSoup" for getting around in the html file.