Search code examples
javaxmldomstax

#xD; and #13; while reading and writing XML file


I have an XML file input from a network API. When I try to save this as XML file from browser, it has some excess 
 in it. Problem is while trying to parse this XML data through StaX and after processing, performing some task write back to another XML format as DOM, it has 
 instead.

All I want to do is to avoid these excess 
 from input and 
 from output. Can't quite find a reason behind these neither a solution clearly.

This is what I get in input XML Element value after saving to file,

Today is a fine day.

So does everyday.

And after writing, the output

Today is a fine day.

So does everyday.

Actually expected and required output

<someNode>Today is a fine day.

So does everyday.
</someNode>

The new line in the Text value of the node is intentional and needs to be preserved as it is.

Simplified Code sample:

Reading stream from API:

// Get Input XML stream from API
URL apiURL = new URL(API_Url);
HttpsURLConnection httpsAPIURLConn;
httpsAPIURLConn = (HttpsURLConnection) apiURL.openConnection();
httpsAPIURLConn.setConnectTimeout(10000); // timeout
httpsAPIURLConn.setDoInput(true);
InputStream inStream = httpsAPIURLConn.getInputStream();

// Data stream okay, Start StaX XLIFF reader
XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
// This is to read entity referenced strings
xmlInputFactory.setProperty(XMLInputFactory.IS_COALESCING, true);

// StaX StreamReader
XMLStreamReader xmlStreamReader = xmlInputFactory.createXMLStreamReader(new BufferedInputStream(inStream), "UTF-8");

// Read and load XML data to in-memory database to filter and process

Writing New XML structure file after filtering and processing original data

// After processing and writing new Element structure to org.w3c.dom.Document
// write the content into xml file
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer tr = transformerFactory.newTransformer();
tr.setOutputProperty(OutputKeys.INDENT, "yes");
tr.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
tr.setOutputProperty(OutputKeys.METHOD, "xml");
tr.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
tr.setOutputProperty(OutputKeys.STANDALONE, "no");

DOMSource source = new DOMSource(doc);
File file = new File(xmlFilePath);
Writer outputStream = new OutputStreamWriter(new FileOutputStream(file), "UTF-8");
StreamResult result = new StreamResult(outputStream);
tr.transform(source, result);

Not sure what exactly I missed. But any suggestions or help would be great.


Solution

  • The simplest solution (besides from hooking into the SAX event stream) is to write an XSLT script which does exactly what you need, and invoke that as your transformer instead of the default identity transformer.

    See http://en.wikipedia.org/wiki/Identity_transform#Using_XSLT for suggestions.

    You then need to supply your own rule for transforming text nodes, where you delete ASCII 13 characters by translating them to the empty string. See https://stackoverflow.com/a/5084382/53897 for details.