Search code examples
javaxmldomunicodesax

Keep numeric character entity characters such as `
 
` when parsing XML in Java


I am parsing XML that contains numeric character entity characters such as (but not limited to) &#10; &#13; &lt; &gt; (line feed carriage return < >) in Java. While parsing, I am appending text content of nodes to a StringBuffer to later write it out to a textfile.

However, these unicode characters are resolved or transformed into newlines/whitespace when I write the String to a file or print it out.

How can I keep the original numeric character entity characters symbols when iterating over nodes of an XML file in Java and storing the text content nodes to a String?

Example of demo xml file:

<?xml version="1.0" encoding="UTF-8"?>
<ABCD version="2">    
    <Field attributeWithChar="A string followed by special symbols &#13;  &#10;" />
</ABCD>

Example Java code. It loads the XML, iterates over the nodes and collects the text content of each node to a StringBuffer. After the iteration is over, it writes the StringBuffer to the console and also to a file (but no &#10; &#13;) symbols.

What would be a way to keep these symbols when storing them to a String? Could you please help me? Thank you.

public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException, TransformerException {   
    DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
    Document document = null;
    DocumentBuilder documentBuilder = documentFactory.newDocumentBuilder();
    document = documentBuilder.parse(new File("path/to/demo.xml"));
    StringBuilder sb = new StringBuilder();

    NodeList nodeList = document.getElementsByTagName("*");
    for (int i = 0; i < nodeList.getLength(); i++) {
        Node node = nodeList.item(i);
        if (node.getNodeType() == Node.ELEMENT_NODE) {
            NamedNodeMap nnp = node.getAttributes();
            for (int j = 0; j < nnp.getLength(); j++) {
                sb.append(nnp.item(j).getTextContent());
            }
        }
    }
    System.out.println(sb.toString());

    try (Writer writer = new BufferedWriter(new OutputStreamWriter(
            new FileOutputStream("path/to/demo_output.xml"), "UTF-8"))) {
        writer.write(sb.toString());
    }
}

Solution

  • You need to escape all the XML entities before parsing the file into a Document. You do that by escaping the ampersand & itself with its corresponding XML entity &amp;. Something like,

    DocumentBuilder documentBuilder =
            DocumentBuilderFactory.newInstance().newDocumentBuilder();
    
    String xmlContents = new String(Files.readAllBytes(Paths.get("demo.xml")), "UTF-8");
    
    Document document = documentBuilder.parse(
             new InputSource(new StringReader(xmlContents.replaceAll("&", "&amp;"))
            ));
    

    Output :

    2A string followed by special symbols &#13;  &#10;