I'm trying to dump the whole contents of a certain site using HTMLUnit, but when I try to do this in a certain (rather intrincate) site, I get an empty file (not an empty file per se, but it has an empty head tag, an empty body tag and that's it).
And here's my code:
BufferedWriter writer = new BufferedWriter(new FileWriter(fullOutputPath));
HtmlPage page;
final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_8);
webClient.setCssEnabled(false);
webClient.setPopupBlockerEnabled(true);
webClient.setRedirectEnabled(true);
webClient.setThrowExceptionOnScriptError(false);
webClient.setThrowExceptionOnFailingStatusCode(false);
webClient.setUseInsecureSSL(true);
webClient.setJavaScriptEnabled(true);
page = webClient.getPage(url);
dumpString += page.asXml();
writer.write(dumpString);
writer.close();
webClient.closeAllWindows();
Some people say that I need to introduce a pause in my code, since the page takes a while to load in Google Chrome, but I set long pauses and it doesn't work.
Thanks in advanced.
Just some ideas...
Retrieving that URL with wget
returns a non-trivial HTML file. Likewise running your code with webClient.setJavaScriptEnabled(false)
. So it's definitely something to do with the Javascript in the page.
With Javascript enabled, I see from the logs that a bunch of Javascript jobs are being queued up, and I get see corresponding errors like this:
EcmaError: lineNumber=[49] column=[0] lineSource=[<no source>] name=[TypeError] sourceName=[https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js] message=[TypeError: Cannot read property "nodeType" from undefined (https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js#49)]
com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot read property "nodeType" from undefined (https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js#49)
at
com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:601)
Maybe those jobs are meant to populate your HTML? So when they fail, the resulting HTML is empty?
The error looks strange, as HtmlUnit usually has no issues with JQuery. I suspect the issue is with the code calling that particular line of the JQuery library.