Search code examples
pythongoogle-chromegoogle-chrome-devtoolsheadlessheadless-browser

How can I tell that the page has finished loading?


I'm playing with Chromium's headless web browser API. Based on chrome_remote_shell source code, I came up with the following code:

#!/usr/bin/env python

import json
import requests
import pprint
import websocket

tablist = json.loads(requests.get("http://%s:%s/json" % ("localhost", 9222)).text)
print(tablist)
wsurl = tablist[0]['webSocketDebuggerUrl']
conn = websocket.create_connection(wsurl)
navcom = json.dumps({"id":0, "method":"Network.enable"})
conn.send(navcom)
navcom = json.dumps({"id":1, "method":"Page.navigate", "params":{"url":"https://news.ycombinator.com/"}})
conn.send(navcom)

while True:
    packet = json.loads(conn.recv())
    if 'method' in packet:
        print(packet['method'])
    else:
        print(packet)

Here's example output:

[{u'description': u'', u'title': u'Hacker News', u'url': u'https://news.ycombinator.com/', u'webSocketDebuggerUrl': u'ws://localhost:9222/devtools/page/7d03a57d-77a9-4ceb-b645-3b85461de5be', u'type': u'page', u'id': u'7d03a57d-77a9-4ceb-b645-3b85461de5be', u'devtoolsFrontendUrl': u'/devtools/inspector.html?ws=localhost:9222/devtools/page/7d03a57d-77a9-4ceb-b645-3b85461de5be'}]
{u'id': 0, u'result': {}}
Network.requestWillBeSent
{u'id': 1, u'result': {u'frameId': u'21045.1'}}
Network.responseReceived
Network.dataReceived
Network.dataReceived
Network.loadingFinished
Network.requestWillBeSent
Network.requestWillBeSent
Network.requestServedFromCache
Network.responseReceived
Network.dataReceived
Network.loadingFinished
Network.requestWillBeSent
Network.requestServedFromCache
Network.responseReceived
Network.dataReceived
Network.loadingFinished
Network.requestWillBeSent
Network.requestServedFromCache
Network.responseReceived
Network.dataReceived
Network.loadingFinished
Network.responseReceived
Network.dataReceived
Network.loadingFinished
Network.requestWillBeSent
Network.requestServedFromCache
Network.responseReceived
Network.dataReceived
Network.loadingFinished

I noticed that I get a long stream of messages, last one of them being Network.loadingFinished, but I got this one for multiple requestIds. How can I modify my script so that it terminates when the page fully loaded and I can escape the loop?


Solution

  • It turns out I should have also subscribed to page events via Page.enable:

    #!/usr/bin/env python
    
    import json
    import requests
    import pprint
    import websocket
    import sys
    
    tablist = json.loads(requests.get("http://%s:%s/json" % ("localhost", 9222)).text)
    print(tablist)
    wsurl = tablist[0]['webSocketDebuggerUrl']
    conn = websocket.create_connection(wsurl)
    navcom = json.dumps({"id":0, "method":"Network.enable"})
    conn.send(navcom)
    navcom = json.dumps({"id":1, "method":"Page.enable"})
    conn.send(navcom)
    navcom = json.dumps({"id":2, "method":"Page.navigate", "params":{"url":sys.argv[1]}})
    conn.send(navcom)
    
    while True:
        s = conn.recv()
        packet = json.loads(s)
        if packet.get('method') == 'Page.loadEventFired':
            break
        print(s)
    

    What we're doing here is enabling notifications for both Page and Network items, then opening the website and reading all messages that happen after. Once we reach Page.loadEventFired, we can assume that the page finished loading, which is when we can exit the loop and carry out any actions that depend on this condition.