java eclipse oracle-database unicode xerces

Java Sun/Oracle xerces parser bug?

I'm using the com.sun.org.apache.xerces parser in the JDK 1.8 rt.jar with eclipse luna. Parsing an XML document with an attribute and CDATA section that contain identical strings, which consist of Chinese characters, like this:

<tns:metaData tns:name="𪂂 - 𠮟 - 𪂂𠮟">
  <tns:metaValue><![CDATA[𪂂 - 𠮟 - 𪂂𠮟]]></tns:metaValue>
</tns:metaData>

After parsing, the attribute string looks like this:

𪂂 - 𪂂𠮟 - 𪂂𠮟𪂂𪂂𠮟𪂂𠮟

i.e., some of the characters (or pairs of characters) are duplicated, but the text from the CDATA looks good:

𪂂 - 𠮟 - 𪂂𠮟

Has anyone run across a similar issue? Any help would be appreciated.

Solution

I guess the answer is "yes, it is a bug in the Sun/Oracle version of xerces." I just tried this with the latest build from apache and it works!

Note, if you're going to use the command line -Djava.endorsed.dirs="..." switch, you'll need to add both xerces and xalan binaries because it is required by eclipse. For windoze it's something like:

 -Djava.endorsed.dirs="C:\Program Files (x86)\Java\xerces-2_11_0"
 -Djava.endorsed.dirs="C:\Program Files (x86)\Java\xalan-j_2_7_2"

Cheers, Bob