Why does Javax' XPath evaluate() method not return elements with non-breaking space when the selector uses the text() node test

I have the following Java code

    @Test
    public void notGettingNonBreakingSpace() throws ParserConfigurationException, IOException, SAXException, XPathExpressionException {
        DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
        documentBuilderFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

        DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();

        String html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
            "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \n" +
            "\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n" +
            "<html xmlns=\"http://www.w3.org/1999/xhtml\">\n" +
            "<body><table><tr><td>&nbsp;</td></tr></table></body>\n" +
            "</html>";

        Document document = documentBuilder.parse(new ByteArrayInputStream(html.getBytes()));
        XPath xpath = XPathFactory.newInstance().newXPath();
        int result = ((NodeList) xpath.evaluate("//tr/td/text()", document, XPathConstants.NODESET)).getLength();
        assertEquals(1, result);
    }

The assertion fails, as result is 0. If I take the HTML, however, save it as an .htm file, and open it in Chrome, $x("//tr/td/text()") in the Developer Tools Console returns as expected:

[text]
> 0: text
  length: 1
> __proto__: Array(0)

What do I need to do to get the same result in Java, i.e. a node list with one item?

Is there an "ignore whitespace" setting on the DocumentBuilder or the XPath object somewhere, or is the root cause that Java and Chrome's JS engine disagree how to handle that special whitespace character?

NB: Removing the text() (i.e. the text node selection) works; it then returns the right result. Replacing the non-breaking space ( ) with actual text (e.g. foo) also works...

Solution

It looks like Java is not able to recognize   when the dtd loading is disabled.

Your problem can be solved by writing an entity for   in html like:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" [ <!ENTITY nbsp " "> ]>

The evaluate now gives one text node.