Search code examples
javawebkithtml-parsing

How to obtain html from loaded page using WebKit in Java


My goal is to use Java to parse Airbnb listing pages, such as this one: https://www.airbnb.com/rooms/28149735

I first tried with JSoup as follows:

String html = Jsoup.connect(webPage).get().html();

However it does not work, as it cannot load the scripts of the page and does not render what I see when I inspect the loaded page from a browser such as Chrome or Firefox.

So I am now trying to use WebKit, with the following code:

// get the instance of the webkit
BrowserEngine browser = BrowserFactory.getWebKit();
Page page = browser.navigate("https://www.airbnb.com/rooms/28149735");
page.show();

String html = page.getDocument().getBody().getInnerHTML();

But this does not work either: the page properly loads (I see it with the logs in the console and a pop up shows properly), but then once I have my loaded page, I cannot access the html (I get a null pointer exception, see below for error log).

When I run the code in debug mode, I looked at the page object, and the document in this page is showing as "null", which seems to create the error.

So my question is: what am I doing wrong and how can I get the html of the loaded page?

Thank you very much in advance!

PS: Here is the error:

Exception in thread "JavaFX Application Thread" io.webfolder.ui4j.api.util.Ui4jException: java.lang.NullPointerException
    at io.webfolder.ui4j.webkit.aspect.WebKitAspect$CallableExecutor.run(WebKitAspect.java:41)
    at com.sun.javafx.application.PlatformImpl.lambda$null$172(PlatformImpl.java:295)
    at java.security.AccessController.doPrivileged(Native Method)
    at com.sun.javafx.application.PlatformImpl.lambda$runLater$173(PlatformImpl.java:294)
    at com.sun.glass.ui.InvokeLaterDispatcher$Future.run$$$capture(InvokeLaterDispatcher.java:95)
    at com.sun.glass.ui.InvokeLaterDispatcher$Future.run(InvokeLaterDispatcher.java)
    at com.sun.glass.ui.gtk.GtkApplication._runLoop(Native Method)
    at com.sun.glass.ui.gtk.GtkApplication.lambda$null$48(GtkApplication.java:139)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
    at io.webfolder.ui4j.webkit.dom.WebKitDocument.getBody_aroundBody12(WebKitDocument.java:74)
    at io.webfolder.ui4j.webkit.dom.WebKitDocument$AjcClosure13.run(WebKitDocument.java:1)
    at io.webfolder.ui4j.internal.aspectj.runtime.reflect.JoinPointImpl.proceed(JoinPointImpl.java:149)
    at io.webfolder.ui4j.webkit.aspect.WebKitAspect$CallableExecutor.run(WebKitAspect.java:39)
    ... 8 more

Solution

  • Is there a specific reason that you're using WebKit? This can be done fairly easy in standard Java.

    URL oracle = new URL("http://www.oracle.com/");
    BufferedReader in = new BufferedReader(
    new InputStreamReader(oracle.openStream()));
    
    String inputLine;
    while ((inputLine = in.readLine()) != null)
        System.out.println(inputLine);
        in.close();
    }
    

    Above was taken directly from the Oracle documentation.