Search code examples
pythondomscreen-scrapingmechanize

Raw HTML vs. DOM scraping in python using mechanize and beautiful soup


I am attempting to write a program that, as an example, will scrape the top price off of this web page:

http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults

First, I am easily able to retrieve the HTML by doing the following:

from urllib import urlopen 
from BeautifulSoup import BeautifulSoup
import mechanize

webpage = 'http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults'
br = mechanize.Browser()
data = br.open(webpage).get_data()

soup = BeautifulSoup(data)
print soup

However, the raw HTML does not contain the price. The browser does...it's thing (clarification here might help me also)...and retrieves the price from elsewhere while it constructs the DOM tree.

I was led to believe that mechanize would act just like my browser and return the DOM tree, which I am also led to believe is what I see when I look at, for example, Chrome's Developer Tools view of the page (if I'm incorrect about this, how do I go about getting whatever that price information is stored in?) Is there something that I need to tell mechanize to do in order to see the DOM tree?

Once I can get the DOM tree into python, everything else I need to do should be a snap. Thanks!


Solution

  • Answering my own question because in the years since asking this I have learned a lot. Today I would use Selenium Webdriver to do this job. Selenium is exactly the tool I was looking for back in 2012 for this type of web scraping project.

    https://www.seleniumhq.org/download/

    http://chromedriver.chromium.org/