My goal is to use Java to parse Airbnb listing pages, such as this one: https://www.airbnb.com/rooms/28149735
I first tried with JSoup as follows:
String html = Jsoup.connect(webPage).get().html();
However it does not work, as it cannot load the scripts of the page and does not render what I see when I inspect the loaded page from a browser such as Chrome or Firefox.
So I am now trying to use WebKit, with the following code:
// get the instance of the webkit
BrowserEngine browser = BrowserFactory.getWebKit();
Page page = browser.navigate("https://www.airbnb.com/rooms/28149735");
page.show();
String html = page.getDocument().getBody().getInnerHTML();
But this does not work either: the page properly loads (I see it with the logs in the console and a pop up shows properly), but then once I have my loaded page, I cannot access the html (I get a null pointer exception, see below for error log).
When I run the code in debug mode, I looked at the page object, and the document in this page is showing as "null", which seems to create the error.
So my question is: what am I doing wrong and how can I get the html of the loaded page?
Thank you very much in advance!
PS: Here is the error:
Exception in thread "JavaFX Application Thread" io.webfolder.ui4j.api.util.Ui4jException: java.lang.NullPointerException
at io.webfolder.ui4j.webkit.aspect.WebKitAspect$CallableExecutor.run(WebKitAspect.java:41)
at com.sun.javafx.application.PlatformImpl.lambda$null$172(PlatformImpl.java:295)
at java.security.AccessController.doPrivileged(Native Method)
at com.sun.javafx.application.PlatformImpl.lambda$runLater$173(PlatformImpl.java:294)
at com.sun.glass.ui.InvokeLaterDispatcher$Future.run$$$capture(InvokeLaterDispatcher.java:95)
at com.sun.glass.ui.InvokeLaterDispatcher$Future.run(InvokeLaterDispatcher.java)
at com.sun.glass.ui.gtk.GtkApplication._runLoop(Native Method)
at com.sun.glass.ui.gtk.GtkApplication.lambda$null$48(GtkApplication.java:139)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at io.webfolder.ui4j.webkit.dom.WebKitDocument.getBody_aroundBody12(WebKitDocument.java:74)
at io.webfolder.ui4j.webkit.dom.WebKitDocument$AjcClosure13.run(WebKitDocument.java:1)
at io.webfolder.ui4j.internal.aspectj.runtime.reflect.JoinPointImpl.proceed(JoinPointImpl.java:149)
at io.webfolder.ui4j.webkit.aspect.WebKitAspect$CallableExecutor.run(WebKitAspect.java:39)
... 8 more
Is there a specific reason that you're using WebKit? This can be done fairly easy in standard Java.
URL oracle = new URL("http://www.oracle.com/");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
}
Above was taken directly from the Oracle documentation.