Search code examples
javaxmlcharacter-encodingxml-parsingxerces

Encoding Issue Causing MalformedByteSequenceException in Xerces UTF8Reader


I'm encountering com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException with an XML file. I stepped through the Xerces code with a debugger and narrowed down the area where this was ocurring. I was able to determine that by removing the "smart quote" characters in the document, the document becomes parseable.

The document came with no DTD. Notepad++ pegs it as "ANSI as UTF-8". Firefox pegs it as "Western". I recall from a not-so-breathtaking lecture in college that UTF-8 was designed to be backward-compatible with single-byte encoding systems. I also see that on this chart, the byte sequence e2 80 9d is, in fact, representative of a "RIGHT DOUBLE QUOTATION MARK", but even though I can't see an encoding problem, I'm thinking there is one.

The exception message I'm getting from Xerces is Invalid byte 3 of 3-byte UTF-8 sequence. It's getting thrown from the invalidByte(3, 3, b2) call on line 435 of UTF8Reader. When I try to fully understand the logic of this method, my brain begins to melt out of my ears a little so I could be missing something, but as I mentioned above byte 3 (0x90). at least of the sequence above, is valid according to the UTF-8 table.

Here is the segment of the file where the double quote occurs shown in a hex editor:XML segment

I have tried the following:

  • Forcing the String to be loaded using UTF-8 via Charset.forName("UTF-8")
  • Adding the DTD <?xml version="1.0" encoding="UTF-8"?>
  • Opening the file in Notepad++ and encoding it as UTF-8 through its UI
  • Various combinations of the above, sometimes repeatedly

The byte indicated as invalid seems to be 63 (0x3F?) eclipse screenshot

I've also tried adding this smart quote character to a document that was previously parseable. As expected, it makes the parser throw up the same exception.

Stack trace:

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:687)
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:435)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(XMLEntityScanner.java:1426)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2815)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:347)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)

...

Update: I still need to find a way to safely convert this to a String. I've encoded the file as UTF-8 using Notepad++. The code below successfully loads the bytes into a String (I can see read the XML in the String when when debugging in Eclipse), but now I'm getting MalformedByteSequenceException with different parameters. This time, I can post both the code and XML I'm using:

File file = new File("ccd.xml");

byte[] ccdBytes = org.apache.commons.io.FileUtils.readFileToByteArray(file);
String ccdString = new String(ccdBytes, Charset.forName("UTF-8"));

CDAUtil.load(new ByteArrayInputStream(IOUtils.toByteArray(ccdString))); //method that's doing the parsing

Stack Trace:

Exception in thread "main" com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:687)
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:557)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(XMLEntityScanner.java:1426)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2815)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:347)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
    at org.openhealthtools.mdht.emf.runtime.resource.impl.FleXMLLoadImpl.load(FleXMLLoadImpl.java:55)
    at org.eclipse.emf.ecore.xmi.impl.XMLResourceImpl.doLoad(XMLResourceImpl.java:180)
    at org.eclipse.emf.ecore.resource.impl.ResourceImpl.load(ResourceImpl.java:1494)
    at org.openhealthtools.mdht.uml.cda.util.CDAUtil.load(CDAUtil.java:268)
    at org.openhealthtools.mdht.uml.cda.util.CDAUtil.load(CDAUtil.java:250)
    at org.openhealthtools.mdht.uml.cda.util.CDAUtil.load(CDAUtil.java:238)

However,

CDAUtil.load(new FileInputStream(new File("ccd.xml")));

works


Solution

  • You did not told us how you pass the file to Xerces. You can do in a different ways with different results. You can read a more detailed explanation about xml encoding issues here

    I suggest you to do the following one:

    1. Open the file with notedpad++ and add <?xml version="1.0" encoding="UTF-8"?> as first line if is missing
    2. In Notepad++ do a convertion to UTF-8 (without bom) (it should be in the format menu but I'm using the italian version of notepad++ so I'm guessing about menu translation)
    3. Save the file
    4. In Java open it as an InputStream,i.e. pass to xml parser an InputStream and NOT as a Reader subclass

    This should solve the problem, if you can past your code that pass the file to parser it's easier to find the problem.

    The steps solve the problem because the first line in xml with the encoding declaration is considered by the parser only if you are using an InputStream i.e a stream of bytes. If you read a stream of bytes you need an encoding declaration in order to specify how to convert bytes into chars.

    If you are passing the a String the first line is useless because you are passing a stream of chars and there is no need of encoding.

    If you want to use a String you must read the file as a InputStream and convert to a Reader specifying a charset (some thing like InputStreamReader inputStreamReader= new InputStreamReader(xmlFileInputStream,"UTF-8");)

    My guess is that you were getting the error because you did not specify the charset and Java picked up the one of your os (Windows-1252).