Search code examples
dumphtmlunit

Intrincate sites using htmlunit


I'm trying to dump the whole contents of a certain site using HTMLUnit, but when I try to do this in a certain (rather intrincate) site, I get an empty file (not an empty file per se, but it has an empty head tag, an empty body tag and that's it).

The site is https://www.abcdin.cl/abcdin/abcdin.nsf#https://www.abcdin.cl/abcdin/abcdin.nsf/linea?openpage&cat=Audio&cattxt=TV%20y%20Audio&catpos=03&linea=LCD&lineatxt=LCD%20&

And here's my code:

BufferedWriter writer = new BufferedWriter(new FileWriter(fullOutputPath));
HtmlPage page;
final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_8);
webClient.setCssEnabled(false);
webClient.setPopupBlockerEnabled(true);
webClient.setRedirectEnabled(true);
webClient.setThrowExceptionOnScriptError(false);
webClient.setThrowExceptionOnFailingStatusCode(false);
webClient.setUseInsecureSSL(true);
webClient.setJavaScriptEnabled(true);
page = webClient.getPage(url);
dumpString += page.asXml();
writer.write(dumpString);
writer.close();
webClient.closeAllWindows();

Some people say that I need to introduce a pause in my code, since the page takes a while to load in Google Chrome, but I set long pauses and it doesn't work.

Thanks in advanced.


Solution

  • Just some ideas...

    Retrieving that URL with wget returns a non-trivial HTML file. Likewise running your code with webClient.setJavaScriptEnabled(false). So it's definitely something to do with the Javascript in the page.

    With Javascript enabled, I see from the logs that a bunch of Javascript jobs are being queued up, and I get see corresponding errors like this:

    EcmaError: lineNumber=[49] column=[0] lineSource=[<no source>] name=[TypeError] sourceName=[https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js] message=[TypeError: Cannot read property "nodeType" from undefined (https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js#49)]
    com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot read property "nodeType" from undefined (https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js#49)
    at     
    com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:601)
    

    Maybe those jobs are meant to populate your HTML? So when they fail, the resulting HTML is empty?

    The error looks strange, as HtmlUnit usually has no issues with JQuery. I suspect the issue is with the code calling that particular line of the JQuery library.