In my application, I use LSSerializer
to convert an XML document into a string with pretty-print format:
public static String convertDocumentToString(Document doc) {
DOMImplementationLS domImplementation = (DOMImplementationLS) doc.getImplementation();
LSSerializer lsSerializer = domImplementation.createLSSerializer();
lsSerializer.getDomConfig().setParameter("format-pretty-print", Boolean.TRUE); // Set this to true if the output needs to be beautified.
return lsSerializer.writeToString(doc);
}
On 1 of my page, I have the following pretty-print XML string:
<result>
<category catKey="school_level">
<category catKey="primary">
<category catKey="primary_1">
<category catKey="math_primary_1"/>
<category catKey="chinese_primary_1"/>
</category>
<category catKey="primary_2"/>
<category catKey="primary_3"/>
</category>
<category catKey="jc"/>
</category>
</result>
I use the following method to parse the above string:
public static Document parseXml(String xml)
throws ParserConfigurationException, IOException, SAXException {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
docFactory.setNamespaceAware(false);
docFactory.setValidating(false);
docFactory.setFeature("http://xml.org/sax/features/namespaces", false);
docFactory.setFeature("http://xml.org/sax/features/validation", false);
docFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
docFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse(new InputSource(new StringReader(xml)));
return doc;
}
This is my test function:
public void test() {
Document doc = Test.parseXml("pretty-print-XML-string");
NodeList childList = result.getDocumentElement().getChildNodes();
for (int j = 0 ; j < childList.getLength() ; j++) {
System.out.println("TEST: " + childList.item(j));
}
}
I expected to see only 1 category
child node. However, on the console, I saw the following lines:
INFO: TEST 2: [#text:
]
INFO: TEST 2: [category: null]
INFO: TEST 2: [#text:
]
INFO: TEST 2: [#text:
]
If I remove lsSerializer.getDomConfig().setParameter("format-pretty-print", Boolean.TRUE);
from the convertDocumentToString
function, all those [#text:]
nodes do not appear anymore.
I'd be very grateful if someone could explain to me why there're some [#text:]
nodes in the parsed document. Besides, please give me an advice on how I should parse a pretty-print XML string.
In order to pretty print, new lines and spaces was added to the content you provided.
When you parse the pretty printed XML you get additional text nodes containing those new lines and spaces.
If I recall correctly you can tell the DocumentBuilderFactory to ignore white space nodes.