I have the following Java code
@Test
public void notGettingNonBreakingSpace() throws ParserConfigurationException, IOException, SAXException, XPathExpressionException {
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
String html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \n" +
"\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n" +
"<html xmlns=\"http://www.w3.org/1999/xhtml\">\n" +
"<body><table><tr><td> </td></tr></table></body>\n" +
"</html>";
Document document = documentBuilder.parse(new ByteArrayInputStream(html.getBytes()));
XPath xpath = XPathFactory.newInstance().newXPath();
int result = ((NodeList) xpath.evaluate("//tr/td/text()", document, XPathConstants.NODESET)).getLength();
assertEquals(1, result);
}
The assertion fails, as result
is 0
. If I take the HTML, however, save it as an .htm
file, and open it in Chrome, $x("//tr/td/text()")
in the Developer Tools Console returns as expected:
[text]
> 0: text
length: 1
> __proto__: Array(0)
What do I need to do to get the same result in Java, i.e. a node list with one item?
Is there an "ignore whitespace" setting on the DocumentBuilder or the XPath object somewhere, or is the root cause that Java and Chrome's JS engine disagree how to handle that special whitespace character?
NB: Removing the text()
(i.e. the text node selection) works; it then returns the right result. Replacing the non-breaking space (
) with actual text (e.g. foo
) also works...
It looks like Java is not able to recognize
when the dtd loading is disabled.
Your problem can be solved by writing an entity for
in html like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" [ <!ENTITY nbsp " "> ]>
The evaluate now gives one text node.