Search code examples
pythonqtpyqtpysideqwebview

Load a web page


I am trying to load a web page using PySide's QtWebKit module. According to the documentation (Elements of QWebView; QWebFrame::toHtml()), the following script should print the HTML of the Google Search Page:

from PySide import QtCore
from PySide import QtGui
from PySide import QtWebKit

# Needed if we want to display the webpage in a widget.
app = QtGui.QApplication([])

view = QtWebKit.QWebView(None)
view.setUrl(QtCore.QUrl("http://www.google.com/"))
frame = view.page().mainFrame()
print(frame.toHtml())

But alas it does not. All that is printed is the method's equivalent of a null response:

<html><head></head><body></body></html>

So I took a closer look at the setUrl documentation:

The view remains the same until enough data has arrived to display the new url.

This made me think that maybe I was calling the toHtml() method too soon, before a response has been received from the server. So I wrote a class that overrides the setUrl method, blocking until the loadFinished signal is triggered:

import time

class View(QtWebKit.QWebView):
    def __init__(self, *args, **kwargs):
        super(View, self).__init__(*args, **kwargs)
        self.completed = True
        self.loadFinished.connect(self.setCompleted)

    def setCompleted(self):
        self.completed = True

    def setUrl(self, url):
        self.completed = False
        super(View, self).setUrl(url)
        while not self.completed:
            time.sleep(0.2)

view = View(None)
view.setUrl(QtCore.QUrl("http://www.google.com/"))
frame = view.page().mainFrame()
print(frame.toHtml())

That made no difference at all. What am I missing here?

EDIT: Merely getting the HTML of a page is not my end game here. This is a simplified example of code that was not working the way I expected it to. Credit to Oleh for suggesting replacing time.sleep() with app.processEvents()


Solution

  • Copied from my other answer:

    from PySide.QtCore import QObject, QUrl, Slot
    from PySide.QtGui import QApplication
    from PySide.QtWebKit import QWebPage, QWebSettings
    
    qapp = QApplication([])
    
    def load_source(url):
        page = QWebPage()
        page.settings().setAttribute(QWebSettings.AutoLoadImages, False)
        page.mainFrame().setUrl(QUrl(url))
    
        class State(QObject):
            src = None
            finished = False
    
            @Slot()
            def loaded(self, success=True):
                self.finished = True
                if self.src is None:
                    self.src = page.mainFrame().toHtml()
        state = State()
    
        # Optional; reacts to DOM ready, which happens before a full load
        def js():
            page.mainFrame().addToJavaScriptWindowObject('qstate$', state)
            page.mainFrame().evaluateJavaScript('''
                document.addEventListener('DOMContentLoaded', qstate$.loaded);
            ''')
        page.mainFrame().javaScriptWindowObjectCleared.connect(js)
    
        page.mainFrame().loadFinished.connect(state.loaded)
    
        while not state.finished:
            qapp.processEvents()
    
        return state.src
    

    load_source downloads the data from an URL and returns the HTML after modification by WebKit. It wraps Qt's event loop with its asynchronous events, and is a blocking function.

    But you really should think what you're doing. Do you actually need to invoke the engine and get the modified HTML? If you just want to download HTML of some webpage, there are much, much simpler ways to do this.

    Now, the problem with the code in your answer is you don't let Qt do anything. There is no magic happening, no code running in background. Qt is based on an event loop, and you never let it enter that loop. This is usually achieved by calling QApplication.exec_ or with a workaround processEvents as shown in my code. You can replace time.sleep(0.2) with app.processEvents() and it might just work.