Search code examples
javascriptjqueryhtmljsouphtmlunit

Simulating page click in HtmlUnit (2.33) gives invalid or illegal selector exception


First I should say that I don't know Javascript well at all. I'm trying to simulate a click on a hyperlink page from Bloomberg. I want to grab a list of news items (hyperlinks), then simply traverse through the list getting each article title and the article text. This is my code:

public List<String> getBloomNewsHtmlUnit() throws IOException {
    String searchString = "Apple";
    List<String> bloombergNewsAll = new ArrayList<>();

    WebClient webclient = new WebClient(BrowserVersion.BEST_SUPPORTED);

    HtmlPage mainpage = webclient.getPage("http://www.bloomberg.com/search?query=" + searchString);

    HtmlAnchor pageanchor = mainpage.getFirstByXPath("//*[@id=\"content\"]/div/section/section[2]/section[1]/div[2]/div[2]/article/div[1]/h1/a");

    webclient.waitForBackgroundJavaScript(50000);
    webclient.getOptions().setThrowExceptionOnScriptError(false);
    webclient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webclient.setCssErrorHandler(new SilentCssErrorHandler());

    mainpage = pageanchor.click();

    System.out.println("Main page: " + mainpage.asText());

    return bloombergNewsAll;
    //  return bloombergNewsAll;
}

This is the exception:

Sep 11, 2016 9:49:34 AM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError
SEVERE: runtimeError: message=[An invalid or illegal selector was specified (selector: '*,:x' error: Invalid selector: :x).] sourceName=[https://assets.bwbx.io/business/public/javascripts/application-6e1529c288.js] line=[153] lineSource=[null] lineOffset=[0]
Exception in thread "main" java.lang.RuntimeException: com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot call method "split" of undefined (https://assets.bwbx.io/business/public/javascripts/application-6e1529c288.js#79)
at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:284)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:519)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:386)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:304)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:451)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:436)
at com.jsoup.test.BloombergTest.getBloomNewsHtmlUnit(BloombergTest.java:71)
at com.jsoup.test.BloombergTest.main(BloombergTest.java:37)
Caused by: com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot call method "split" of undefined (https://assets.bwbx.io/business/public/javascripts/application-6e1529c288.js#79)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:921)
at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:515)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:803)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:779)
at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:975)
at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:352)
at com.gargoylesoftware.htmlunit.html.HtmlScript$2.execute(HtmlScript.java:238)
at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:277)
... 7 more
Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot call method "split" of undefined (https://assets.bwbx.io/business/public/javascripts/application-6e1529c288.js#79)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3915)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3899)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError(ScriptRuntime.java:3924)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError2(ScriptRuntime.java:3940)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.undefCallError(ScriptRuntime.java:3956)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getPropFunctionAndThisHelper(ScriptRuntime.java:2390)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getPropFunctionAndThis(ScriptRuntime.java:2384)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1342)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:800)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:413)
at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:252)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3264)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:115)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:794)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:906)
... 15 more
Java Result: 1

Even if I try to execute the first 4 lines of my code (without any reference to the HtmlAnchor), the same error comes up. I read a few bug reports about this error online but none of the suggested solutions seem to be working in my case:

htmlunit : An invalid or illegal selector was specified

In the SOF question above, I applied the suggested waitForBackgroundJavaScript to the webclient, but this did not solve the problem.

JavaScript Exception in HtmlUnit when clicking at google result page

In this question I tried to add:

JavaScriptEngine engine = webclient.getJavaScriptEngine();
    engine.holdPosponedActions();

to the code, but the error was still there.

https://sourceforge.net/p/htmlunit/bugs/1744/

In the above bug report, the solution was suggested as redefining the main page with the select query result. In my case I tried redefining the page with a click() event. My code doesn't get that far and throws the error as soon as I try to define the HtmlPage.

https://sourceforge.net/p/htmlunit/bugs/1661/

This report suggests simply ignoring the warnings, but in my case I'm getting an exception (not just warnings), which prevents the desired output.

I first tried to do the scraping this using Jsoup. This worked fine but Jsoup was giving some erroneous links in between the article text which were not on the original page when I inspected it in Chrome. I suspect that there was a JS or Ajax call which changed the page DOM. This is why i chose to use Htmlunit.

Would appreciate any tips on what I'm doing wrong to get this error and how to correct it. Also, if anybody thinks that it is possible to use Jsoup only to achieve what I want please let me know (I just read that Jsoup doesn't support dynamic changes in the DOM so won't work on its own). Thanks in advance!


Solution

  • The exception doesn't necessarily mean, that the resulting page is useless, though it might be different in other cases. You have to check the result for the content you are looking for.

    To reduce the output of error messages from the javascript engine you can define:

    java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
    

    The following example selects the first headline, triggers the click event and grabs the resulting page; to verify, that we followed the link, the title is printed out:

    java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
    
    final WebClient webClient = new WebClient(BrowserVersion.CHROME);
    
    webClient.getOptions().setCssEnabled(false);
    webClient.getOptions().setJavaScriptEnabled(true);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.getOptions().setTimeout(10000);
    
    try {
        HtmlPage page = webClient.getPage("http://www.bloomberg.com/search?query=Apple");
    
        System.out.println(page.getTitleText());
    
        ScriptResult result = page.executeJavaScript("document.querySelector(\"#content > div > section > section.search-results__content > section.content-stories > div.search-result-items > div:nth-child(1) > article > div > h1 > a\").click()");
    
        page = (HtmlPage)result.getNewPage();
    
        System.out.println(page.getTitleText());
    
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        webClient.close();
    }
    

    Since the pages are not populated using javascript, you could also skip HtmlUnit altogether and use a html parser like jsoup:

    News class

    class News{
        private String title;
        private String href;
        private String content="";
    
        public String getTitle() {
            return title;
        }
    
        public String getHref() {
            return href;
        }
    
        public String getContent() {
            return content;
        }
    
        public void setContent(String content) {
            this.content = content;
        }
    
        public News(String title, String href){
            this.title=title;
            this.href=href;
        }
    }
    

    Example code for grabbing news from the first two pages (adjustable through numberOfResultpages):

    List<News> bloombergNewsAll = new ArrayList<>();
    
    String searchString = "Apple";
    String searchUrl = "http://www.bloomberg.com/search?query=" + searchString + "&page=";
    int numberOfResultpages = 2;
    Document doc;
    
    // grab title and href
    for (int i = 1; i <= numberOfResultpages; i++) {
        try {
            doc = Jsoup.connect(searchUrl + i)
                    .userAgent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36")
                    .referrer("http://www.bloomberg.com/").get();
            Elements searchResults = doc.select("#content > div > section > section.search-results__content > section.content-stories > div.search-result-items > div > article > div > h1");
            if(searchResults.isEmpty()) break; // no more searchResults
    
            for (Element result : searchResults) {
                bloombergNewsAll.add(new News(result.text(), result.select("a").attr("href")));
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    
    // grab content
    for (News news : bloombergNewsAll) {
    
        try {
            doc = Jsoup.connect(news.href)
                    .userAgent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36")
                    .referrer("http://www.bloomberg.com/search?query=Apple").get();
    
            if(news.getHref().contains("bloomberg.com/news/videos")) continue;
    
            if(news.getHref().contains("bloomberg.com/news/")){
                news.setContent(doc.select("div.article-body__content").text());
            }else if(news.getHref().contains("bloomberg.com/gadfly")){
                news.setContent(doc.select("#article > div.body_ZtDFu > div.container_1KxJx").text());
            }else if(news.getHref().contains("bloomberg.com/view")){
                news.setContent(doc.select("div._31WvjDF17ltgFb1fNB1WqY").text());
            }
    
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    
    // do something useful with your results
    for (News news : bloombergNewsAll) {
        System.out.println(news.getTitle() + "\n\t" + news.getHref() + "\n\t" + (news.getContent().length()>150 ? news.getContent().substring(0, 150) : news.getContent()));
    }