The problem
I have the following question: I need to search for some information about a company using the following link.
What I need to do with it is a search by entity name
with search type
being "begin with" drop down value. I also would like to see "All items" per page in the Display number of items to view
part. For example, if I input "google" in the "Enter name" text box, the script should return a list of companies with names start with "google" (though this is just the starting point of what I want to do).
Question: How should I use Python to do this? I found the following thread: Using Python to ask a web page to run a search
I tried the example in the first answer, the code is put below:
from bs4 import BeautifulSoup as BS
import requests
protein='Q9D880'
text = requests.get('http://www.uniprot.org/uniprot/' + protein).text
soup = BS(text)
MGI = soup.find(name='a', onclick="UniProt.analytics('DR-lines', 'click', 'DR-MGI');").text
MGI = MGI[4:]
print protein +' - ' + MGI
The above code works because the UniPort
website contains analytics
, which takes those parameters. However,the website I am using doesn't have that.
I also tried to do the same thing as the first answer in this thread: how to submit query to .aspx page in python
However, the example code provide in the 1st answer does not work on my machine (Ubuntu 12.4 with Python 2.7). I am also not clear about which values should be there since I am dealing with a different aspx website.
How could I use Python to start a search with certain criteria (not sure this is proper web terminology, may be submit a form?) ?
I am from a C++ background and did not do any web stuff. I am also learning Python. Any help is greatly appreciated.
First EDIT:
With great help from @Kabie, I collected the following code (trying to understand how it works):
import requests
from lxml import etree
URL = 'http://corp.sec.state.ma.us/CorpWeb/CorpSearch/CorpSearch.aspx'
#With get_fields(), we fetched all <input>s from the form.
def get_fields():
res = requests.get(URL)
if res.ok:
page = etree.HTML(res.text)
fields = page.xpath('//form[@id="Form1"]//input')
return { e.attrib['name']: e.attrib.get('value', '') for e in fields }
#hard code some selects from the Form
def query(data):
formdata = get_fields()
formdata.update({
'ctl00$MainContent$ddRecordsPerPage':'25',
}) # Hardcode some <select> value
formdata.update(data)
res = requests.post(URL, formdata)
if res.ok:
page = etree.HTML(res.text)
return page.xpath('//table[@id="MainContent_SearchControl_grdSearchResultsEntity"]//tr')
def search_by_entity_name(entity_name, entity_search_type='B'):
return query({
'ctl00$MainContent$CorpSearch':'rdoByEntityName',
'ctl00$MainContent$txtEntityName': entity_name,
'ctl00$MainContent$ddBeginsWithEntityName': entity_search_type,
})
result = search_by_entity_name('google')
The above code is put in a script named query.py
. I got the following error:
Traceback (most recent call last): File "query.py", line 39, in
result = search_by_entity_name('google')
File "query.py", line 36, in search_by_entity_name
'ctl00$MainContent$ddBeginsWithEntityName': entity_search_type,
File "query.py", line 21, in query
formdata.update({
AttributeError: 'NoneType' object has no attribute 'update'
It seems to me that the search is not successful? Why?
You can inspect the page to find out all the fields need to be posted. There is a nice tutorial for Chrome DevTools
. Other tools like FireBug
on FireFox or DragonFly
on Opera also do the work while I recommend DevTools
.
After you post a query. In the Network
panel, you can see the form data which actually been sent. In this case:
__EVENTTARGET:
__EVENTARGUMENT:
__LASTFOCUS:
__VIEWSTATE:5UILUho/L3O0HOt9WrIfldHD4Ym6KBWkQYI1GgarbgHeAdzM9zyNbcH0PdP6xtKurlJKneju0/aAJxqKYjiIzo/7h7UhLrfsGul1Wq4T0+BroiT+Y4QVML66jsyaUNaM6KNOAK2CSzaphvSojEe1BV9JVGPYWIhvx0ddgfi7FXKIwdh682cgo4GHmilS7TWcbKxMoQvm9FgKY0NFp7HsggGvG/acqfGUJuw0KaYeWZy0pWKEy+Dntb4Y0TGwLqoJxFNQyOqvKVxnV1MJ0OZ4Nuxo5JHmkeknh4dpjJEwui01zK1WDuBHHsyOmE98t2YMQXXTcE7pnbbZaer2LSFNzCtrjzBmZT8xzCkKHYXI31BxPBEhALcSrbJ/QXeqA7Xrqn9UyCuTcN0Czy0ZRPd2wabNR3DgE+cCYF4KMGUjMUIP+No2nqCvsIAKmg8w6Il8OAEGJMAKA01MTMONKK4BH/OAzLMgH75AdGat2pvp1zHVG6wyA4SqumIH//TqJWFh5+MwNyZxN2zZQ5dBfs3b0hVhq0cL3tvumTfb4lr/xpL3rOvaRiatU+sQqgLUn0/RzeKNefjS3pCwUo8CTbTKaSW1IpWPgP/qmCsuIovXz82EkczLiwhEZsBp3SVdQMqtAVcYJzrcHs0x4jcTAWYZUejvtMXxolAnGLdl/0NJeMgz4WB9tTMeETMJAjKHp2YNhHtFS9/C1o+Hxyex32QxIRKHSBlJ37aisZLxYmxs69squmUlcsHheyI5YMfm0SnS0FwES5JqWGm2f5Bh+1G9fFWmGf2QeA6cX/hdiRTZ7VnuFGrdrJVdbteWwaYQuPdekms2YVapwuoNzkS/A+un14rix4bBULMdzij25BkXpDhm3atovNHzETdvz5FsXjKnPlno0gH7la/tkM8iOdQwqbeh7sG+/wKPqPmUk0Cl0kCHNvMCZhrcgQgpIOOgvI2Fp+PoB7mPdb80T2sTJLlV7Oe2ZqMWsYxphsHMXVlXXeju3kWfpY+Ed/D8VGWniE/eoBhhqyOC2+gaWA2tcOyiDPDCoovazwKGWz5B+FN1OTep5VgoHDqoAm2wk1C3o0zJ9a9IuYoATWI1yd2ffQvx6uvZQXcMvTIbhbVJL+ki4yNRLfVjVnPrpUMjafsnjIw2KLYnR0rio8DWIJhpSm13iDj/KSfAjfk4TMSA6HjhhEBXIDN/ShQAHyrKeFVsXhtH5TXSecY6dxU+Xwk7iNn2dhTILa6S/Gmm06bB4nx5Zw8XhYIEI/eucPOAN3HagCp7KaSdzZvrnjbshmP8hJPhnFhlXdJ+OSYDWuThFUypthTxb5NXH3yQk1+50SN872TtQsKwzhJvSIJExMbpucnVmd+V2c680TD4gIcqWVHLIP3+arrePtg0YQiVTa1TNzNXemDyZzTUBecPynkRnIs0dFLSrz8c6HbIGCrLleWyoB7xicUg39pW7KTsIqWh7P0yOiHgGeHqrN95cRAYcQTOhA==
__SCROLLPOSITIONX:0
__SCROLLPOSITIONY:106
__VIEWSTATEENCRYPTED:
__EVENTVALIDATION:g2V3UVCVCwSFKN2X8P+O2SsBNGyKX00cyeXvPVmP5dZSjIwZephKx8278dZoeJsa1CkMIloC0D51U0i4Ai0xD6TrYCpKluZSRSphPZQtAq17ivJrqP1QDoxPfOhFvrMiMQZZKOea7Gi/pLDHx42wy20UdyzLHJOAmV02MZ2fzami616O0NpOY8GQz1S5IhEKizo+NZPb87FgC5XSZdXCiqqoChoflvt1nfhtXFGmbOQgIP8ud9lQ94w3w2qwKJ3bqN5nRXVf5S53G7Lt+Du78nefwJfKK92BSgtJSCMJ/m39ykr7EuMDjauo2KHIp2N5IVzGPdSsiOZH86EBzmYbEw==
ctl00$MainContent$hdnApplyMasterPageWitoutSidebar:0
ctl00$MainContent$hdn1:0
ctl00$MainContent$CorpSearch:rdoByEntityName
ctl00$MainContent$txtEntityName:GO
ctl00$MainContent$ddBeginsWithEntityName:M
ctl00$MainContent$ddBeginsWithIndividual:B
ctl00$MainContent$txtFirstName:
ctl00$MainContent$txtMiddleName:
ctl00$MainContent$txtLastName:
ctl00$MainContent$txtIdentificationNumber:
ctl00$MainContent$txtFilingNumber:
ctl00$MainContent$ddRecordsPerPage:25
ctl00$MainContent$btnSearch:Search Corporations
ctl00$MainContent$hdnW:1920
ctl00$MainContent$hdnH:1053
ctl00$MainContent$SearchControl$hdnRecordsPerPage:
What I post is Begin with 'GO'
. This site is build with WebForms
, so there are these long __VIEWSTATE
and __EVENTVALIDATION
fields. We need send them as well.
Now we are ready to make the query. First we need to get a blank form. The following code are written in Python 3.3, through I think they should still work on 2.x.
import requests
from lxml import etree
URL = 'http://corp.sec.state.ma.us/CorpWeb/CorpSearch/CorpSearch.aspx'
def get_fields():
res = requests.get(URL)
if res.ok:
page = etree.HTML(res.text)
fields = page.xpath('//form[@id="Form1"]//input')
return { e.attrib['name']: e.attrib.get('value', '') for e in fields }
With get_fields()
, we fetched all <input>
s from the form. Note there are also <select>
s, I will just hardcode them.
def query(data):
formdata = get_fields()
formdata.update({
'ctl00$MainContent$ddRecordsPerPage':'25',
}) # Hardcode some <select> value
formdata.update(data)
res = requests.post(URL, formdata)
if res.ok:
page = etree.HTML(res.text)
return page.xpath('//table[@id="MainContent_SearchControl_grdSearchResultsEntity"]//tr')
Now we have a generic query
function, lets make a wrapper for specific ones.
def search_by_entity_name(entity_name, entity_search_type='B'):
return query({
'ctl00$MainContent$CorpSearch':'rdoByEntityName',
'ctl00$MainContent$txtEntityName': entity_name,
'ctl00$MainContent$ddBeginsWithEntityName': entity_search_type,
})
This specific example site use a group of <radio>
to determine which fields to be used, so 'ctl00$MainContent$CorpSearch':'rdoByEntityName'
here is necessary. And you can make others like search_by_individual_name
etc. by yourself.
Sometimes, website need more information to verify the query. By then you could add some custom headers like Origin
, Referer
, User-Agent
to mimic a browser.
And if the website is using JavaScript to generate forms, you need more than requests
. PhantomJS
is a good tool to make browser scripts. If you want do this in Python, you can use PyQt
with qtwebkit
.
Update:
It seems the website blocked our Python script to access it after yesterday. So we have to feign as a browser. As I mentioned above, we can add a custom header. Let's first add a User-Agent
field to header see what happend.
res = requests.get(URL, headers={
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36',
})
And now... res.ok
returns True
!
So we just need to add this header in both call res = requests.get(URL)
in get_fields()
and res = requests.post(URL, formdata)
in query()
. Just in case, add 'Referer':URL
to the headers of the latter:
res = requests.post(URL, formdata, headers={
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36',
'Referer':URL,
})