Search code examples
javajsouphtmlunit

Cannot download full Document using HtmlUnit and Jsoup combination (using Java)


Problem Statement: I want to crawl this page : http://www.hongkonghomes.com/en/property/rent/the_peak/middle_gap_road/10305?page_no=1&rec_per_page=12&order=rental+desc&offset=0

Lets say I want to parse the address, that is "24, Middle Gap Road, The Peak, Hong Kong"

What I did: I first only tried to load using jsoup, but then I noticed that the page is taking some time to load. So, then I also plugged in HTMLUnit to wait for the page to load first

Code I wrote:

public static void parseByHtmlUnit() throws Exception{
        String url = "http://www.hongkonghomes.com/en/property/rent/the_peak/middle_gap_road/10305?page_no=1&rec_per_page=12&order=rental+desc&offset=0";
        WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38);
        webClient.waitForBackgroundJavaScriptStartingBefore(30000);
        HtmlPage page = webClient.getPage(url);
        synchronized(page) {
            page.wait(30000);
        }
        try {
            Document doc = Jsoup.parse(page.asXml());
            String address = ElementsUtil.getTextOrEmpty(doc.select(".addr"));
            System.out.println("address"+address);
        } catch (Exception e) {
             e.printStackTrace();
        }
}

Expected output : In the console, I should get this output: address 24, Middle Gap Road, The Peak, Hong Kong

Actual output : address


Solution

  • How about this?

    final Document document = Jsoup.parse(
        new URL("http://www.hongkonghomes.com/en/property/rent/the_peak/middle_gap_road/10305?page_no=1&rec_per_page=12&order=rental+desc&offset=0"),
        30000
    );
    System.out.println(document.select(".addr").text());