Search code examples
pythonweb-scrapingbeautifulsoupurlopen

How to use python urlopen scraping after a page finish loading all searching result?


I am trying to scrape air ticket info(including plane info and price info, etc.) from http://flight.qunar.com/ using python3 and BeautifulSoup. Below is the python code I am using. In this code I tried to scrape flight info from Beijing(北京) to Lijiang(丽江) at 2012-07-25.

import urllib.parse
import urllib.request
from bs4 import BeautifulSoup
url = 'http://flight.qunar.com/site/oneway_list.htm'
values = {'searchDepartureAirport':'北京', 'searchArrivalAirport':'丽江', 'searchDepartureTime':'2012-07-25'}
encoded_param = urllib.parse.urlencode(values)
full_url = url + '?' + encoded_param
response = urllib.request.urlopen(full_url)
soup = BeautifulSoup(response)
print(soup.prettify())

What I get is the initial page after submit the request and the page is still loading the search results. What I want is the final page after it finish loading the searching results. So how can I achieve this goal using python?


Solution

  • The problem is actually quite hard - the site uses dynamically generated content that gets loaded via JavaScript, however urllib gets basically only what you would get in a browser if you disabled JavaScript. So, what can we do?

    Use

    to fully render a webpage (they are essentially headless, automated browsers for testing and scraping)

    Or, if you want a (semi-)pure Python solution, use PyQt4.QtWebKit to render the page. It works approxiametly like this:

    import sys
    import signal
    
    from optparse import OptionParser
    from PyQt4.QtCore import *
    from PyQt4.QtGui import *
    from PyQt4.QtWebKit import QWebPage
    
    url = "http://www.stackoverflow.com"
    
    def page_to_file(page):
        with open("output", 'w') as f:
            f.write(page.mainFrame().toHtml())
            f.close()
    
    app = QApplication()
    page = QWebPage()
    signal.signal( signal.SIGINT, signal.SIG_DFL )
    page.connect(page, SIGNAL( 'loadFinished(bool)' ), page_to_file)
    page.mainFrame().load(QUrl(url))
    sys.exit( app.exec_() )
    

    Edit: There's a nice explanation how this works here.

    Ps: You may want to look into requests instead of using urllib :)