python search python-2.7 web screen-scraping

How to start a query from a static website?

The problem

I have the following question: I need to search for some information about a company using the following link.

What I need to do with it is a search by entity name with search type being "begin with" drop down value. I also would like to see "All items" per page in the Display number of items to view part. For example, if I input "google" in the "Enter name" text box, the script should return a list of companies with names start with "google" (though this is just the starting point of what I want to do).

Question: How should I use Python to do this? I found the following thread: Using Python to ask a web page to run a search

I tried the example in the first answer, the code is put below:

from bs4 import BeautifulSoup as BS
import requests

protein='Q9D880'

text = requests.get('http://www.uniprot.org/uniprot/' + protein).text
soup = BS(text)
MGI = soup.find(name='a', onclick="UniProt.analytics('DR-lines', 'click', 'DR-MGI');").text
MGI = MGI[4:]
print protein +' - ' + MGI

The above code works because the UniPort website contains analytics, which takes those parameters. However,the website I am using doesn't have that.

I also tried to do the same thing as the first answer in this thread: how to submit query to .aspx page in python

However, the example code provide in the 1st answer does not work on my machine (Ubuntu 12.4 with Python 2.7). I am also not clear about which values should be there since I am dealing with a different aspx website.

How could I use Python to start a search with certain criteria (not sure this is proper web terminology, may be submit a form?) ?

I am from a C++ background and did not do any web stuff. I am also learning Python. Any help is greatly appreciated.

First EDIT:
With great help from @Kabie, I collected the following code (trying to understand how it works):

import requests
from lxml import etree

URL = 'http://corp.sec.state.ma.us/CorpWeb/CorpSearch/CorpSearch.aspx'

#With get_fields(), we fetched all <input>s from the form.
def get_fields():
    res = requests.get(URL)
    if res.ok:
        page = etree.HTML(res.text)
        fields = page.xpath('//form[@id="Form1"]//input')
        return { e.attrib['name']: e.attrib.get('value', '') for e in fields }

#hard code some selects from the Form
def query(data):
    formdata = get_fields()
    formdata.update({
        'ctl00$MainContent$ddRecordsPerPage':'25',
    }) # Hardcode some <select> value
    formdata.update(data)
    res = requests.post(URL, formdata)
    if res.ok:
        page = etree.HTML(res.text)
        return page.xpath('//table[@id="MainContent_SearchControl_grdSearchResultsEntity"]//tr')


def search_by_entity_name(entity_name, entity_search_type='B'):
    return query({
        'ctl00$MainContent$CorpSearch':'rdoByEntityName',
        'ctl00$MainContent$txtEntityName': entity_name,
        'ctl00$MainContent$ddBeginsWithEntityName': entity_search_type,
    })

result = search_by_entity_name('google')

The above code is put in a script named query.py. I got the following error:

Traceback (most recent call last): File "query.py", line 39, in
result = search_by_entity_name('google')
File "query.py", line 36, in search_by_entity_name
'ctl00$MainContent$ddBeginsWithEntityName': entity_search_type,
File "query.py", line 21, in query
formdata.update({
AttributeError: 'NoneType' object has no attribute 'update'

It seems to me that the search is not successful? Why?

Solution

You can inspect the page to find out all the fields need to be posted. There is a nice tutorial for Chrome DevTools. Other tools like FireBug on FireFox or DragonFly on Opera also do the work while I recommend DevTools.

After you post a query. In the Network panel, you can see the form data which actually been sent. In this case:

__EVENTTARGET:
__EVENTARGUMENT:
__LASTFOCUS:
__VIEWSTATE:5UILUho/L3O0HOt9WrIfldHD4Ym6KBWkQYI1GgarbgHeAdzM9zyNbcH0PdP6xtKurlJKneju0/aAJxqKYjiIzo/7h7UhLrfsGul1Wq4T0+BroiT+Y4QVML66jsyaUNaM6KNOAK2CSzaphvSojEe1BV9JVGPYWIhvx0ddgfi7FXKIwdh682cgo4GHmilS7TWcbKxMoQvm9FgKY0NFp7HsggGvG/acqfGUJuw0KaYeWZy0pWKEy+Dntb4Y0TGwLqoJxFNQyOqvKVxnV1MJ0OZ4Nuxo5JHmkeknh4dpjJEwui01zK1WDuBHHsyOmE98t2YMQXXTcE7pnbbZaer2LSFNzCtrjzBmZT8xzCkKHYXI31BxPBEhALcSrbJ/QXeqA7Xrqn9UyCuTcN0Czy0ZRPd2wabNR3DgE+cCYF4KMGUjMUIP+No2nqCvsIAKmg8w6Il8OAEGJMAKA01MTMONKK4BH/OAzLMgH75AdGat2pvp1zHVG6wyA4SqumIH//TqJWFh5+MwNyZxN2zZQ5dBfs3b0hVhq0cL3tvumTfb4lr/xpL3rOvaRiatU+sQqgLUn0/RzeKNefjS3pCwUo8CTbTKaSW1IpWPgP/qmCsuIovXz82EkczLiwhEZsBp3SVdQMqtAVcYJzrcHs0x4jcTAWYZUejvtMXxolAnGLdl/0NJeMgz4WB9tTMeETMJAjKHp2YNhHtFS9/C1o+Hxyex32QxIRKHSBlJ37aisZLxYmxs69squmUlcsHheyI5YMfm0SnS0FwES5JqWGm2f5Bh+1G9fFWmGf2QeA6cX/hdiRTZ7VnuFGrdrJVdbteWwaYQuPdekms2YVapwuoNzkS/A+un14rix4bBULMdzij25BkXpDhm3atovNHzETdvz5FsXjKnPlno0gH7la/tkM8iOdQwqbeh7sG+/wKPqPmUk0Cl0kCHNvMCZhrcgQgpIOOgvI2Fp+PoB7mPdb80T2sTJLlV7Oe2ZqMWsYxphsHMXVlXXeju3kWfpY+Ed/D8VGWniE/eoBhhqyOC2+gaWA2tcOyiDPDCoovazwKGWz5B+FN1OTep5VgoHDqoAm2wk1C3o0zJ9a9IuYoATWI1yd2ffQvx6uvZQXcMvTIbhbVJL+ki4yNRLfVjVnPrpUMjafsnjIw2KLYnR0rio8DWIJhpSm13iDj/KSfAjfk4TMSA6HjhhEBXIDN/ShQAHyrKeFVsXhtH5TXSecY6dxU+Xwk7iNn2dhTILa6S/Gmm06bB4nx5Zw8XhYIEI/eucPOAN3HagCp7KaSdzZvrnjbshmP8hJPhnFhlXdJ+OSYDWuThFUypthTxb5NXH3yQk1+50SN872TtQsKwzhJvSIJExMbpucnVmd+V2c680TD4gIcqWVHLIP3+arrePtg0YQiVTa1TNzNXemDyZzTUBecPynkRnIs0dFLSrz8c6HbIGCrLleWyoB7xicUg39pW7KTsIqWh7P0yOiHgGeHqrN95cRAYcQTOhA==
__SCROLLPOSITIONX:0
__SCROLLPOSITIONY:106
__VIEWSTATEENCRYPTED:
__EVENTVALIDATION:g2V3UVCVCwSFKN2X8P+O2SsBNGyKX00cyeXvPVmP5dZSjIwZephKx8278dZoeJsa1CkMIloC0D51U0i4Ai0xD6TrYCpKluZSRSphPZQtAq17ivJrqP1QDoxPfOhFvrMiMQZZKOea7Gi/pLDHx42wy20UdyzLHJOAmV02MZ2fzami616O0NpOY8GQz1S5IhEKizo+NZPb87FgC5XSZdXCiqqoChoflvt1nfhtXFGmbOQgIP8ud9lQ94w3w2qwKJ3bqN5nRXVf5S53G7Lt+Du78nefwJfKK92BSgtJSCMJ/m39ykr7EuMDjauo2KHIp2N5IVzGPdSsiOZH86EBzmYbEw==
ctl00$MainContent$hdnApplyMasterPageWitoutSidebar:0
ctl00$MainContent$hdn1:0
ctl00$MainContent$CorpSearch:rdoByEntityName
ctl00$MainContent$txtEntityName:GO
ctl00$MainContent$ddBeginsWithEntityName:M
ctl00$MainContent$ddBeginsWithIndividual:B
ctl00$MainContent$txtFirstName:
ctl00$MainContent$txtMiddleName:
ctl00$MainContent$txtLastName:
ctl00$MainContent$txtIdentificationNumber:
ctl00$MainContent$txtFilingNumber:
ctl00$MainContent$ddRecordsPerPage:25
ctl00$MainContent$btnSearch:Search Corporations
ctl00$MainContent$hdnW:1920
ctl00$MainContent$hdnH:1053
ctl00$MainContent$SearchControl$hdnRecordsPerPage:

What I post is Begin with 'GO'. This site is build with WebForms, so there are these long __VIEWSTATE and __EVENTVALIDATION fields. We need send them as well.

Now we are ready to make the query. First we need to get a blank form. The following code are written in Python 3.3, through I think they should still work on 2.x.

import requests
from lxml import etree

URL = 'http://corp.sec.state.ma.us/CorpWeb/CorpSearch/CorpSearch.aspx'

def get_fields():
    res = requests.get(URL)
    if res.ok:
        page = etree.HTML(res.text)
        fields = page.xpath('//form[@id="Form1"]//input')
        return { e.attrib['name']: e.attrib.get('value', '') for e in fields }

With get_fields(), we fetched all <input>s from the form. Note there are also <select>s, I will just hardcode them.

def query(data):
    formdata = get_fields()
    formdata.update({
        'ctl00$MainContent$ddRecordsPerPage':'25',
    }) # Hardcode some <select> value
    formdata.update(data)
    res = requests.post(URL, formdata)
    if res.ok:
        page = etree.HTML(res.text)
        return page.xpath('//table[@id="MainContent_SearchControl_grdSearchResultsEntity"]//tr')

Now we have a generic query function, lets make a wrapper for specific ones.

def search_by_entity_name(entity_name, entity_search_type='B'):
    return query({
        'ctl00$MainContent$CorpSearch':'rdoByEntityName',
        'ctl00$MainContent$txtEntityName': entity_name,
        'ctl00$MainContent$ddBeginsWithEntityName': entity_search_type,
    })

This specific example site use a group of <radio> to determine which fields to be used, so 'ctl00$MainContent$CorpSearch':'rdoByEntityName' here is necessary. And you can make others like search_by_individual_name etc. by yourself.

Sometimes, website need more information to verify the query. By then you could add some custom headers like Origin, Referer, User-Agent to mimic a browser.

And if the website is using JavaScript to generate forms, you need more than requests. PhantomJS is a good tool to make browser scripts. If you want do this in Python, you can use PyQt with qtwebkit.

Update: It seems the website blocked our Python script to access it after yesterday. So we have to feign as a browser. As I mentioned above, we can add a custom header. Let's first add a User-Agent field to header see what happend.

res = requests.get(URL, headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36',
})

And now... res.ok returns True!

So we just need to add this header in both call res = requests.get(URL) in get_fields() and res = requests.post(URL, formdata) in query(). Just in case, add 'Referer':URL to the headers of the latter:

res = requests.post(URL, formdata, headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36',
    'Referer':URL,
})