Search code examples
pythonbeautifulsoupmechanize-python

Some nudges for a first-time scraper


I'm trying to programatically (in Python) retrieve account information from this website for a list of properties I have (identified by BRT number).

This should be very simple, and I've read a few things I've found via Google, but it's all way over my head as I've no web development experience so all the vernacular is in-one-ear-out-the-other.

The procedure should be very simple, as the web page seems very no-frills:

  1. Set brt, e.g. 883309000.

  2. Open the url: http://www.phila.gov/revenue/RealEstateTax/default.aspx.

  3. Select the by BRT Number field and enter brt.

  4. Click the >> button to retrieve property info.

  5. Scrape the bottom line (TOTALS) and the accurate-to date, in this case:

    TOTALS $13,359.83 $2,539.14 $1,417.73 $1,645.59 $18,962.29

and

06/30/2015

I'm principally stuck on steps 3 and 4. I've gotten as far as:

import mechanize
from bs4 import BeautifulSoup

br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36')]
br.open('http://www.phila.gov/revenue/RealEstateTax/default.aspx')

soup = BeautifulSoup(br.response().read())

#Here's the BRT Number field
soup.find("input",{"id":"ctl00_BodyContentPlaceHolder_SearchByBRTControl_txtTaxInfo"})

#Here's the "Lookup by BRT" button
soup.find("input",{"id":"ctl00_BodyContentPlaceHolder_SearchByBRTControl_btnTaxByBRT"})

But I am really lost on what to do from there. Any help would be appreciated.


Solution

  • Have you considered using the selenium package for python. The documentation for this is here, I strongly suggest you read this through, run a few basic tests to check your understanding and skim it through again before starting.

    The point of Selenium is to load the page as you would in your browser and perform commands (which you can automate using python code).

    First import selenim:

    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    

    Then begin the webdriver and load the page, 'assert' will check that the page has "Revenue Department" in the title before proceeding.

    driver = webdriver.Firefox()
    driver.get("http://www.phila.gov/revenue/RealEstateTax/default.aspx")
    assert "Revenue Department" in driver.title
    

    Following this we need to select the BRT input box and send keys brt

    driver.find_element_by_id("ctl00_BodyContentPlaceHolder_SearchByBRTControl_txtTaxInfo").send_keys(brt)
    

    Finally we need to push the >> button

    driver.find_element_by_id("ctl00_BodyContentPlaceHolder_SearchByBRTControl_btnTaxByBRT").click()
    

    Now you should be taken to the page of results