I am trying to web scrape https://www.nba.com/standings#/
Here is my code
What I am trying to use is page.getByXPath("//caption[@class='standings__header']/span")
Which should pull back Eastern Conference and Western Conference but instead it pulls back nothing I don't understand if my Xpath is wrong?
package Standings;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlSpan;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class Standings {
private static final String baseUrl = "https://www.nba.com/standings#/";
public static void main(String[] args) {
WebClient client = new WebClient();
client.getOptions().setJavaScriptEnabled(false);
client.getOptions().setCssEnabled(false);
client.getOptions().setUseInsecureSSL(true);
String jsonString = "";
ObjectMapper mapper = new ObjectMapper();
try {
HtmlPage page = client.getPage(baseUrl);
System.out.println(page.asXml());
page.getByXPath("//caption[@class='standings__header']/span")
} catch (IOException e) {
e.printStackTrace();
}
}
}
Have used this code to verify your problem:
public static void main(String[] args) throws IOException {
final String url = "https://www.nba.com/standings#/";
try (final WebClient webClient = new WebClient()) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(10000);
System.out.println(page.asXml());
}
}
When running this i got a bunch of warning and errors in the log.
(BTW: the page produces also many error/warnings when running with real browsers. Seems the maintainer of the page has a interesting view on quality)
I guess the problematic error is this one
TypeError: Cannot modify readonly property: constructor. (https://www.nba.com/ng/game/main.js#1)
There is a known bug in the javascript support of HtmlUnit (https://sourceforge.net/p/htmlunit/bugs/1897/). Because the bug is thrown from main.js i guess this will stop the processing of the page javascript before the content you are looking for is generated.
So far i found no time to fix this (looks like this has to be fixed in Rhino) but this one is on the list.
Have a look at https://twitter.com/HtmlUnit to get informed about updates.