Search code examples
pythonasp.netweb-scrapingbeautifulsouphtml-post

Scraping Data from .ASPX Website URL with Python


I have a static .aspx url that I am trying to scrape. All of my attempts yield the raw html data of the regular website instead of the data I am querying.

My understanding is the headers I am using (which I found from another post) are correct and generalizable:

import urllib.request
from bs4 import BeautifulSoup

headers = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Accept-Encoding': 'gzip,deflate,sdch',
    'Accept-Language': 'en-US,en;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}

class MyOpener(urllib.request.FancyURLopener):
    version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'

myopener = MyOpener()
url = 'https://www.mytaxcollector.com/trSearch.aspx'
# first HTTP request without form data
f = myopener.open(url)
soup_dummy = BeautifulSoup(f,"html5lib")
# parse and retrieve two vital form values
viewstate = soup_dummy.select("#__VIEWSTATE")[0]['value']
viewstategen = soup_dummy.select("#__VIEWSTATEGENERATOR")[0]['value']

Trying to enter the form data causes nothing to happen:

formData = (
    ('__VIEWSTATE', viewstate),
    ('__VIEWSTATEGENERATOR', viewstategen),
    ('ctl00_contentHolder_trSearchCharactersAPN', '631091430000'),
    ('__EVENTTARGET', 'ct100$MainContent$calculate')
)

encodedFields =  urllib.parse.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)


soup = BeautifulSoup(f,"html5lib")
trans_emissions = soup.find("span", id="ctl00_MainContent_transEmissions")
print(trans_emissions.text)

This give raw html code almost exactly the same as the "soup_dummy" variable. But what I want to see is the data of the field ('ctl00_contentHolder_trSearchCharactersAPN', '631091430000') being submitted (this is the "parcel number" box.

I would really appreciate the help. If anything, linking me to a good post about HTML requests (one that not only explains but actually walks through scraping aspx) would be great.


Solution

  • To get the result using the parcel number, your parameters have to be somewhat different from what you have already tried with. Moreover, you have to use this url https://www.mytaxcollector.com/trSearchProcess.aspx to send the post requests.

    Working code:

    from urllib.request import Request, urlopen
    from urllib.parse import urlencode
    from bs4 import BeautifulSoup
    
    url = 'https://www.mytaxcollector.com/trSearchProcess.aspx'
    
    payload = {
        'hidRedirect': '',
        'hidGotoEstimate': '',
        'txtStreetNumber': '',
        'txtStreetName': '',
        'cboStreetTag': '(Any Street Tag)',
        'cboCommunity': '(Any City)',
        'txtParcelNumber': '0108301010000',  #your search term
        'txtPropertyID': '',
        'ctl00$contentHolder$cmdSearch': 'Search'
    }
    
    data = urlencode(payload)
    data = data.encode('ascii')
    req = Request(url,data)
    req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36')
    res = urlopen(req)
    soup = BeautifulSoup(res.read(),'html.parser')
    for items in soup.select("table.propInfoTable tr"):
        data = [item.get_text(strip=True) for item in items.select("td")]
        print(data)