Search code examples
javaasciitransformer-model

Java Transformer converts Chinese character to ASCII value


Ok after lot of search I decided to ask question here. Below is the sample code to reproduce my problem. The document object is build with chinese character.

String value= "𧀠";
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("value");      
root.setAttribute("attribute", value);
doc.appendChild(root);      
DOMSource source = new DOMSource(doc);  

I am trying to convert the document source to string using the Transformer class with the below code.

ByteArrayOutputStream outStream = null;
Transformer transformer = TransformerFactory.newInstance().newTransformer();
StreamResult htmlStreamResult = new StreamResult( new ByteArrayOutputStream() );        
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");        
transformer.transform(source, htmlStreamResult);                    
outStream = (ByteArrayOutputStream) htmlStreamResult.getOutputStream();
String outPut = outStream.toString( "UTF-8" );

But I got output with converted Chinese characters as below.

<?xml version="1.0" encoding="UTF-8" standalone="no"?><value attribute="&#159776;"/>

I do not want the Chinese character to be converted but to be displayed as it is. Appreciate if anyone help me on this.


Solution

  • Change UTF-8 to UTF-16. Since you're making a String (which is code-page agnostic) this has no ill effect on the encoding. This however adds code-page declaration and sometimes a BOM (Byte-Order-Mark) in the XML header. You can optionally leave the header out and attach your own.

        String value= "𧀠かな〜"; // (I don't see your character so I added some of my own)
        DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
        Document doc = builder.newDocument();
        Element root = doc.createElement("value");
        root.setAttribute("attribute", value);
        doc.appendChild(root);
        DOMSource source = new DOMSource(doc);
    
        ByteArrayOutputStream outStream = null;
        Transformer transformer = TransformerFactory.newInstance().newTransformer();
        StreamResult htmlStreamResult = new StreamResult( new ByteArrayOutputStream() );
        transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-16");
    //  transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes"); // optional
        transformer.transform(source, htmlStreamResult);
        outStream = (ByteArrayOutputStream) htmlStreamResult.getOutputStream();
        String outPut = outStream.toString( "UTF-16" );
        System.out.println(outPut);
    

    Output:

    <?xml version="1.0" encoding="UTF-16" standalone="no"?><value attribute="𧀠かな〜"/>