Search code examples
javaweb-scrapinghtmlunit

Web Scraping with Java using HTMLUnit


I am trying to web scrape https://www.nba.com/standings#/

Here is my code

What I am trying to use is page.getByXPath("//caption[@class='standings__header']/span")

Which should pull back Eastern Conference and Western Conference but instead it pulls back nothing I don't understand if my Xpath is wrong?

    package Standings;

    import com.fasterxml.jackson.databind.ObjectMapper;
    import com.gargoylesoftware.htmlunit.WebClient;
    import com.gargoylesoftware.htmlunit.html.HtmlElement;
    import com.gargoylesoftware.htmlunit.html.HtmlPage;
    import com.gargoylesoftware.htmlunit.html.HtmlSpan;

    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.List;

    public class Standings {
          private static final String baseUrl = "https://www.nba.com/standings#/";

        public static void main(String[] args) {
            WebClient client = new WebClient();
            client.getOptions().setJavaScriptEnabled(false);
            client.getOptions().setCssEnabled(false);
            client.getOptions().setUseInsecureSSL(true);
            String jsonString = "";
            ObjectMapper mapper = new ObjectMapper();

            try {
                HtmlPage page = client.getPage(baseUrl);
                System.out.println(page.asXml());

                page.getByXPath("//caption[@class='standings__header']/span")
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

Solution

  • Have used this code to verify your problem:

    public static void main(String[] args) throws IOException {
        final String url = "https://www.nba.com/standings#/";
    
        try (final WebClient webClient = new WebClient()) {
            webClient.getOptions().setThrowExceptionOnScriptError(false);
            webClient.getOptions().setCssEnabled(false);
            webClient.getOptions().setUseInsecureSSL(true);
    
            HtmlPage page = webClient.getPage(url);
            webClient.waitForBackgroundJavaScript(10000);
    
            System.out.println(page.asXml());
        }
    }
    

    When running this i got a bunch of warning and errors in the log.

    (BTW: the page produces also many error/warnings when running with real browsers. Seems the maintainer of the page has a interesting view on quality)

    I guess the problematic error is this one

    TypeError: Cannot modify readonly property: constructor. (https://www.nba.com/ng/game/main.js#1)

    There is a known bug in the javascript support of HtmlUnit (https://sourceforge.net/p/htmlunit/bugs/1897/). Because the bug is thrown from main.js i guess this will stop the processing of the page javascript before the content you are looking for is generated.

    So far i found no time to fix this (looks like this has to be fixed in Rhino) but this one is on the list.

    Have a look at https://twitter.com/HtmlUnit to get informed about updates.