Search code examples
javahtmlunit

How to add some wait time between the page request and DOM response in HtmlUnit?


enter image description hereI'm trying to get all links related to a certain webpage (https://digital.utc.com/our-latest) using the HtmlUnit, but apparently, it's not retrieving all links inside the page

I've tried to add some wait time for the HtmlUnit before retrieving the DOM then add it in the HtmlPage.I suspect that it the HtmlUnit retrieve the DOM and assign it to the htmlpage once it gets connected to the webpage using "WebClient.getpage()" without leaving any time for the page to load the data from the database. but I can't find any way to do so using HtmlUnit

public void pageScrapping() throws FailingHttpStatusCodeException, MalformedURLException, IOException
    {
        //Initializing the WebClient 
        WebClient webClient = new WebClient();
        webClient.setThrowExceptionOnScriptError(false);
        webClient.setThrowExceptionOnFailingStatusCode(false);
        webClient.setCssEnabled(false);
        webClient.setJavaScriptEnabled(false);
        webClient.setTimeout(10000);

        HtmlPage page = webClient.getPage("https://digital.utc.com/our-latest");

        try 
        {
            Thread.sleep(3000);
        }

        catch (InterruptedException e) 
        {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        page = page.getPage();
        String htmlContent2 = page.asXml();
        File htmlFile2 = new File("Website2_XML.html");
        PrintWriter pw2 = new PrintWriter(htmlFile2);
        pw2.print(htmlContent2);
        pw2.close();

        System.out.println(page.getTitleText());

        DomNodeList<HtmlElement> links = (DomNodeList<HtmlElement>) page.getElementsByTagName("a");

        for (HtmlElement domElement : links) 
        {
            System.out.println(domElement.getAttribute("href"));
            System.out.println();
        }

    }
  • What I expected is that the HtmlUnit will return the whole links found having 'href' attribute in the webpage

  • The actual result returned by HtmlUnit has some missing links that are not retrieved from the page even it is returned correctly by the browser inspector

** the missing links will be found on the right in form or articles list that is retrieved from the Database


Solution

  • The only links i see (using this code) without href are anchors with an onClick handler. Can you please add more details about what you miss.

        final String url = "https://digital.utc.com/our-latest";
    
        try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
            webClient.getOptions().setThrowExceptionOnScriptError(false);
            webClient.getOptions().setCssEnabled(false);;
            webClient.getOptions().setJavaScriptEnabled(false);
    
            HtmlPage page = webClient.getPage(url);
            webClient.waitForBackgroundJavaScript(4_000);
    
            System.out.println(page.asXml());
    
            DomNodeList<DomElement> links = page.getElementsByTagName("a");
            for (DomElement domElement : links)
            {
                String href = domElement.getAttribute("href");
                System.out.println(domElement.asXml());
            }
        }
    

    And as always make sure you are using the latest SNAPSHOT build.

    Update: have done a small fix for the media query processing to avoid the NPE you are facing when running my code. Please use the latest SNAPSHOT build.