I'm trying to crawl a website using htmlunit. Whenever I run it though it only outputs the following error:
Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot read property "push" from undefined (https://www.kinoheld.de/dist/prod/0.4.7/widget.js#1)
Now I don't know much about JS, but I read that push
is some kind of array operation. This seems standard to me and I don't know why it would not be supported by htmlunit.
Here is the code I'm using so far:
public static void main(String[] args) throws IOException {
WebClient web = new WebClient(BrowserVersion.FIREFOX_45);
web.getOptions().setUseInsecureSSL(true);
String url = "https://www.kinoheld.de/kino-muenchen/royal-filmpalast/vorstellung/280823/?mode=widget&showID=280828#panel-seats";
web.getOptions().setThrowExceptionOnFailingStatusCode(false);
web.waitForBackgroundJavaScript(9000);
HtmlPage response = web.getPage(url);
System.out.println(response.getTitleText());
}
What am I missing? Is there a way around this or a way to fix this? Thanks in advance!
I've encountered a similar problem before. This is an issue with HTML Unit being designed as a test harness framework rather than a web scraping one. Are you running the latest version of HTML Unit?
I was able to run your code by adding both the setThrowExceptionOnScriptError(false)
(as mentioned in Coffee Converter's answer) line as well as adding
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
at the top of the method to disable the log dump. This yielded an output of:
Royal Filmpalast München München | kinoheld.de
Full code is as follows:
public static void main(String[] args) throws IOException {
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
String url = "https://www.kinoheld.de/kino-muenchen/royal-filmpalast/vorstellung/280823/?mode=widget&showID=280828#panel-seats";
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.waitForBackgroundJavaScript(9000);
HtmlPage response = webClient.getPage(url);
System.out.println(response.getTitleText());
}
This was run on RedHat command line with HTML Unit 2.2.1. Hope this helps.