Search code examples
javaxmlunicodetransform

javax.xml.transform.TransformerFactory Unicode issue- Java


We are not able transform Unicode Characters properly. We are giving input in XML format, when we try to transform we are not able to get back the original string.

This is the code i'm using,

StringCarrier OStringCarrier = new StringCarrier();
String SXmlFileData= "<export_candidate_response><criteria><output><lastname>Bhagavath</lastname><firstname>ガネーシュ</firstname></output></export_candidate_response>";

String SResult = "";
 try
    {
      TransformerFactory tFactory = TransformerFactory.newInstance();
      Transformer transformer = tFactory.newTransformer(new StreamSource(SXslFileName));
      transformer.setOutputProperty(OutputKeys.ENCODING, "UTF8");
      OutputStream xmlResult = (OutputStream)new ByteArrayOutputStream();
      StreamResult outResult = new StreamResult(xmlResult);
      transformer.transform(new StreamSource(
          new ByteArrayInputStream(SXmlFileData.getBytes("UTF8"))),outResult);

      SResult = outResult.getOutputStream().toString();

      }
catch (TransformerConfigurationException OException)
    {
        //Exception has been thrown
        OException.printStackTrace();
        return OStringCarrier;
    }
     catch (TransformerException OException)
    {
        //Exception has been thrown
        OException.printStackTrace();
        return OStringCarrier;
    }
    catch (Exception OException)
    {
        //Exception has been thrown
        OException.printStackTrace();
        return OStringCarrier;
    }

This is the output i'm getting ガãƒ?ーシュ in place of ガネーシュ


Solution

  • This is the output i'm getting ガãƒ?ーシュ in place of ガネーシュ

    That tells you that somewhere in this process, data in UTF-8 is being read by a piece of software that thinks it is reading Latin-1. What it doesn't tell you is where in the process this is happening. So you need to divide and conquer - you need to find the last point at which the data is correct.

    Start by establishing whether the problem is before the transformation or after it. That's very easy if you're using an XSLT 2.0 processor: you can use <xsl:message select="string-to-codepoints($in)"/> to see what string of characters the XSLT processor has been given. It's a bit trickier with a 1.0 processor, but you can use substring($in, $n, 1) to extract the nth character, and that should give you a clue.

    My suspicion is that it's the input. Firstly, putting non-ASCII characters in a Java string literal is always a bit dangerous, because the round trip to a source repository can easily corrupt the code if you're not very careful about everything being configured correctly. Secondly, if the string is correct, it would be much safer to read it using a StringReader, rather than converting it to a byte stream. Try:

    transformer.transform(new StreamSource(
              new StringReader(SXmlFileData)),outResult);