Search code examples
pythonscreen-scrapinghtmlunitpycurl

Screen scraping with Python


Does Python have screen scraping libraries that offer JavaScript support?

I've been using pycurl for simple HTML requests, and Java's HtmlUnit for more complicated requests requiring JavaScript support.

Ideally I would like to be able to do everything from Python, but I haven't come across any libraries that would allow me to do it. Do they exist?


Solution

  • There are many options when dealing with static HTML, which the other responses cover. However if you need JavaScript support and want to stay in Python I recommend using webkit to render the webpage (including the JavaScript) and then examine the resulting HTML. For example:

    import sys
    import signal
    from PyQt4.QtCore import *
    from PyQt4.QtGui import *
    from PyQt4.QtWebKit import QWebPage
    
    class Render(QWebPage):
        def __init__(self, url):
            self.app = QApplication(sys.argv)
            QWebPage.__init__(self)
            self.html = None
            signal.signal(signal.SIGINT, signal.SIG_DFL)
            self.connect(self, SIGNAL('loadFinished(bool)'), self._finished_loading)
            self.mainFrame().load(QUrl(url))
            self.app.exec_()
    
        def _finished_loading(self, result):
            self.html = self.mainFrame().toHtml()
            self.app.quit()
    
    
    if __name__ == '__main__':
        try:
            url = sys.argv[1]
        except IndexError:
            print 'Usage: %s url' % sys.argv[0]
        else:
            javascript_html = Render(url).html