I have an org.w3c.dom.Document
and want to serialize it with this function, but I get an SAXException
. How could I fix this?
public static String serializeXmlDocument(Document document) throws Exception
{
// set up a transformer
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer trans = transformerFactory.newTransformer();
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
trans.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
trans.setOutputProperty(OutputKeys.INDENT, "yes");
DOMSource source = new DOMSource(document);
// create string from xml tree
StringWriter stringWriter = new StringWriter();
StreamResult stringResult = new StreamResult(stringWriter);
trans.transform(source, stringResult);
return stringWriter.toString();
}
This results in the following error:
2014-07-20 03:03:36,451 ERROR [XXX] XXX main job error:
javax.xml.transform.TransformerException: org.xml.sax.SAXException: E/A-Fehler
java.io.IOException: Ungültige UTF-16-Ersetzung festgestellt: d835 20 ?
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:758)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:359)
at mypackage.handler.XmlHandler.serializeXmlDocument(XmlHandler.java:226)
at mypackage.subpackage.buildSolrXml(MyJob.java:213)
at mypackage.subpackage.doJob(MyJob.java:113)
at mypackage.MyWorkstation.main(MyWorkstation.java:27)
Caused by: org.xml.sax.SAXException: E/A-Fehler
java.io.IOException: Ungültige UTF-16-Ersetzung festgestellt: d835 20 ?
at com.sun.org.apache.xml.internal.serializer.ToStream.cdata(ToStream.java:1290)
at com.sun.org.apache.xml.internal.serializer.ToStream.characters(ToStream.java:1395)
at com.sun.org.apache.xml.internal.serializer.ToUnknownStream.characters(ToUnknownStream.java:814)
at com.sun.org.apache.xml.internal.serializer.ToUnknownStream.characters(ToUnknownStream.java:348)
at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:122)
at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:230)
at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:230)
at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:230)
at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:136)
at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:98)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:702)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:746)
... 5 more
Caused by: java.io.IOException: Ungültige UTF-16-Ersetzung festgestellt: d835 20 ?
at com.sun.org.apache.xml.internal.serializer.ToStream.writeUTF16Surrogate(ToStream.java:973)
at com.sun.org.apache.xml.internal.serializer.ToStream.writeNormalizedChars(ToStream.java:1110)
at com.sun.org.apache.xml.internal.serializer.ToStream.cdata(ToStream.java:1267)
... 16 more
This is not always caused by invalid UTF-16 characters. If a multi-byte UTF-8/16/32 character crosses a 1024 byte boundary anywhere in the Stream
, the Xalan XSLTC processor will split the character into two pieces, which results in two incorrect characters being generated and (in most cases) will produce the above error.
This is due to a Xalan bug (1024-byte buffers), which will be fixed in OpenJDK 12.
The simplest file that triggers this bug is:
<?xml version="1.0" ?><x>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx𝜃</x>
Update (April 9, 2021): It looks like this was "fixed" in Java 8u251 or 8u222 and 11.0.7. However, while the error is avoided, it looks like the character in question is ignored by the parser.
Update (February 8, 2025): Clarification of above comments. The parser Exception in my answer was indeed resolved in JDK 12+ (and 8u251+, 11.0.7+). However, this error is distinct from that in the original question, which is occurring in the GregorSamsa
translet (not in the parser), and still occurs in JDK 17 and JDK 23, which rely on the latest version of Xalan 2.7.3 (2023). Although this 1024-byte buffer boundary issue was fixed in the parser, a different 1024-byte buffer boundary issue exists in the transformer code (deep within Xalan XSLTC, I presume). I managed to create a new minimum reproducible example (below) and filed a bug report against JDK 23 and JDK 17 with Oracle on February 8, 2025. Prior to JDK 12, this bug would have been nearly indistinguishable from the earlier bug.
XML file PMC6002666.xml:
<?xml version="1.0" ?><a>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx𝜃𝜀𝜀<b>𝜀</b>𝜀</a>
XSL file xml-to-text-test.xsl:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output encoding="UTF-8" method="text" />
<xsl:template match="/"><xsl:apply-templates select="node()" /></xsl:template>
</xsl:stylesheet>
Java method:
import java.io.ByteArrayOutputStream;
import java.io.FileInputStream;
import org.w3c.dom.Document;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.Result;
import javax.xml.transform.Source;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
public void testTransform03b() throws Exception
{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document xml = builder.parse(new FileInputStream("PMC6002666.xml"));
Document xsl = builder.parse(new FileInputStream("xml-to-text-test.xsl"));
Source xmlSource = new DOMSource(xml);
Source xslSource = new DOMSource(xsl);
Transformer transformer = TransformerFactory.newInstance().newTransformer(xslSource);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Result result = new StreamResult(baos);
transformer.transform(xmlSource, result);
}
Exception:
javax.xml.transform.TransformerException: com.sun.org.apache.xalan.internal.xsltc.TransletException: org.xml.sax.SAXException: java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
org.xml.sax.SAXException: java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
at java.xml/com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:786)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:395)
at gov.doe.jgi.dce.core.util.XmlUtilsTest.testTransform03b(XmlUtilsTest.java:337)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
Caused by: com.sun.org.apache.xalan.internal.xsltc.TransletException: org.xml.sax.SAXException: java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
org.xml.sax.SAXException: java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
at java.xml/com.sun.org.apache.xalan.internal.xsltc.dom.SAXImpl.characters(SAXImpl.java:1560)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.dom.DOMAdapter.characters(DOMAdapter.java:326)
at jdk.translet/die.verwandlung.GregorSamsa.applyTemplates()
at jdk.translet/die.verwandlung.GregorSamsa.applyTemplates()
at jdk.translet/die.verwandlung.GregorSamsa.template$dot$0()
at jdk.translet/die.verwandlung.GregorSamsa.applyTemplates()
at jdk.translet/die.verwandlung.GregorSamsa.transform()
at java.xml/com.sun.org.apache.xalan.internal.xsltc.runtime.AbstractTranslet.transform(AbstractTranslet.java:627)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:782)
... 3 more
Caused by: org.xml.sax.SAXException: java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
at java.xml/com.sun.org.apache.xml.internal.serializer.ToTextStream.characters(ToTextStream.java:234)
at java.xml/com.sun.org.apache.xml.internal.utils.FastStringBuffer.sendSAXcharacters(FastStringBuffer.java:987)
at java.xml/com.sun.org.apache.xml.internal.dtm.ref.sax2dtm.SAX2DTM2.dispatchCharactersEvents(SAX2DTM2.java:3024)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.dom.SAXImpl.characters(SAXImpl.java:1558)
... 11 more
Caused by: java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
at java.xml/com.sun.org.apache.xml.internal.serializer.ToStream.throwIOE(ToStream.java:1801)
at java.xml/com.sun.org.apache.xml.internal.serializer.ToStream.writeUTF16Surrogate(ToStream.java:975)
at java.xml/com.sun.org.apache.xml.internal.serializer.ToTextStream.writeNormalizedChars(ToTextStream.java:306)
at java.xml/com.sun.org.apache.xml.internal.serializer.ToTextStream.characters(ToTextStream.java:226)
... 14 more
---------
com.sun.org.apache.xalan.internal.xsltc.TransletException: org.xml.sax.SAXException: java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
org.xml.sax.SAXException: java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
at java.xml/com.sun.org.apache.xalan.internal.xsltc.dom.SAXImpl.characters(SAXImpl.java:1560)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.dom.DOMAdapter.characters(DOMAdapter.java:326)
at jdk.translet/die.verwandlung.GregorSamsa.applyTemplates()
at jdk.translet/die.verwandlung.GregorSamsa.applyTemplates()
at jdk.translet/die.verwandlung.GregorSamsa.template$dot$0()
at jdk.translet/die.verwandlung.GregorSamsa.applyTemplates()
at jdk.translet/die.verwandlung.GregorSamsa.transform()
at java.xml/com.sun.org.apache.xalan.internal.xsltc.runtime.AbstractTranslet.transform(AbstractTranslet.java:627)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:782)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:395)
at gov.doe.jgi.dce.core.util.XmlUtilsTest.testTransform03b(XmlUtilsTest.java:337)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
Caused by: org.xml.sax.SAXException: java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
at java.xml/com.sun.org.apache.xml.internal.serializer.ToTextStream.characters(ToTextStream.java:234)
at java.xml/com.sun.org.apache.xml.internal.utils.FastStringBuffer.sendSAXcharacters(FastStringBuffer.java:987)
at java.xml/com.sun.org.apache.xml.internal.dtm.ref.sax2dtm.SAX2DTM2.dispatchCharactersEvents(SAX2DTM2.java:3024)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.dom.SAXImpl.characters(SAXImpl.java:1558)
... 11 more
Caused by: java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
at java.xml/com.sun.org.apache.xml.internal.serializer.ToStream.throwIOE(ToStream.java:1801)
at java.xml/com.sun.org.apache.xml.internal.serializer.ToStream.writeUTF16Surrogate(ToStream.java:975)
at java.xml/com.sun.org.apache.xml.internal.serializer.ToTextStream.writeNormalizedChars(ToTextStream.java:306)
at java.xml/com.sun.org.apache.xml.internal.serializer.ToTextStream.characters(ToTextStream.java:226)
... 14 more
---------
org.xml.sax.SAXException: java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
at java.xml/com.sun.org.apache.xml.internal.serializer.ToTextStream.characters(ToTextStream.java:234)
at java.xml/com.sun.org.apache.xml.internal.utils.FastStringBuffer.sendSAXcharacters(FastStringBuffer.java:987)
at java.xml/com.sun.org.apache.xml.internal.dtm.ref.sax2dtm.SAX2DTM2.dispatchCharactersEvents(SAX2DTM2.java:3024)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.dom.SAXImpl.characters(SAXImpl.java:1558)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.dom.DOMAdapter.characters(DOMAdapter.java:326)
at jdk.translet/die.verwandlung.GregorSamsa.applyTemplates()
at jdk.translet/die.verwandlung.GregorSamsa.applyTemplates()
at jdk.translet/die.verwandlung.GregorSamsa.template$dot$0()
at jdk.translet/die.verwandlung.GregorSamsa.applyTemplates()
at jdk.translet/die.verwandlung.GregorSamsa.transform()
at java.xml/com.sun.org.apache.xalan.internal.xsltc.runtime.AbstractTranslet.transform(AbstractTranslet.java:627)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:782)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:395)
at gov.doe.jgi.dce.core.util.XmlUtilsTest.testTransform03b(XmlUtilsTest.java:337)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
Caused by: java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
at java.xml/com.sun.org.apache.xml.internal.serializer.ToStream.throwIOE(ToStream.java:1801)
at java.xml/com.sun.org.apache.xml.internal.serializer.ToStream.writeUTF16Surrogate(ToStream.java:975)
at java.xml/com.sun.org.apache.xml.internal.serializer.ToTextStream.writeNormalizedChars(ToTextStream.java:306)
at java.xml/com.sun.org.apache.xml.internal.serializer.ToTextStream.characters(ToTextStream.java:226)
... 14 more
---------
java.io.IOException: Invalid UTF-16 surrogate detected: df00 d835 ?
at java.xml/com.sun.org.apache.xml.internal.serializer.ToStream.throwIOE(ToStream.java:1801)
at java.xml/com.sun.org.apache.xml.internal.serializer.ToStream.writeUTF16Surrogate(ToStream.java:975)
at java.xml/com.sun.org.apache.xml.internal.serializer.ToTextStream.writeNormalizedChars(ToTextStream.java:306)
at java.xml/com.sun.org.apache.xml.internal.serializer.ToTextStream.characters(ToTextStream.java:226)
at java.xml/com.sun.org.apache.xml.internal.utils.FastStringBuffer.sendSAXcharacters(FastStringBuffer.java:987)
at java.xml/com.sun.org.apache.xml.internal.dtm.ref.sax2dtm.SAX2DTM2.dispatchCharactersEvents(SAX2DTM2.java:3024)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.dom.SAXImpl.characters(SAXImpl.java:1558)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.dom.DOMAdapter.characters(DOMAdapter.java:326)
at jdk.translet/die.verwandlung.GregorSamsa.applyTemplates()
at jdk.translet/die.verwandlung.GregorSamsa.applyTemplates()
at jdk.translet/die.verwandlung.GregorSamsa.template$dot$0()
at jdk.translet/die.verwandlung.GregorSamsa.applyTemplates()
at jdk.translet/die.verwandlung.GregorSamsa.transform()
at java.xml/com.sun.org.apache.xalan.internal.xsltc.runtime.AbstractTranslet.transform(AbstractTranslet.java:627)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:782)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:395)
at gov.doe.jgi.dce.core.util.XmlUtilsTest.testTransform03b(XmlUtilsTest.java:337)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
A workaround for this (if do don't want to eliminate any valid UTF-8 characters, which is what the code in this answer does, would be to catch the TransformerException
, inspect the message to detect "UTF-16", and programmatically modify the source XML document to pad the front of the text in question with a unique string of, say 32 bytes, that will shove the problematic characters past the 1024 byte buffer boundary, and then after the transformation is complete, do a find and replace to remove the unique string you inserted. It is an awful workaround, I grant that, but it is the only way I can think of to preserve the integrity of the document content until this bug is fixed.
I know it is small consolation 11 years after you posted this, but you definitely discovered something that slipped under the radar of a lot of developers at Sun Microsystems, Oracle, the OpenJDK team, and Xalan.