Search code examples
javautf-8inputstream

Java InputStream read locale dependent?


I have client-server application. The client (a C++ application) is sending UTF8 encoded string and the server (a Java application) is reading those strings through socket-port communication. I am facing issues while reading the string on server side in case the server is hosted on Windows OS with locale CP-1252.

Here is pseudo-code

private transient Socket socket = null;
private transient InputStream in = null;
private transient OutputStream out = null;

socket = new Socket(server, port);
out = socket.getOutputStream();
in = socket.getInputStream();

Socket and InputStream are initialized in some different function and the actual string is read as shown in function below:

ReadString()
{
    byte[] backbytes = new byte[2048];

    {
        if ((c = in.read(backbytes)) > 0) {
            if (debug)
                logger.trace("Read " + c + " bytes");
            total = total + c;
            char[] convertedChar = new char[backbytes.length];
            int[] convertedInt = new int[backbytes.length];
            for(int i=0;i < backbytes.length;i++){
                convertedChar[i] = (char) backbytes[i];
                convertedInt[i] = (int) backbytes[i];
            }

            logFilePrint.print("Read string as : " + new String(backbytes, 0, c) + " and the converted char[] of byte[] is : ");
            printArray(logFilePrint, convertedChar);
            logFilePrint.print(" and converted int[] is : " );
            printArray(logFilePrint, convertedInt);
            logFilePrint.flush();

            sb.append(new String(backbytes, 0, c));
        } else {
          break;
        }
    }
}

The issue happens for certain Unicode characters such as '私' or 'の'. If I execute the above code for these characters, I get output as

Read string as : ç§?ã? and the converted char[] of byte[] is : [, ￧, ᄃ, ?,  ̄, ?,] and converted int[] is : [, -25, -89, 63, -29, 63, -82,]

However if I change Server encoding by setting JVM's charset to UTF8 using "-Dfile.encoding=UTF-8 ", I get output as :

Read string as : 私の and the converted char[] of byte[] is : [, ￧, ᄃ, チ,  ̄, チ, ᆴ] and converted int[] is : [, -25, -89, -127, -29, -127, -82,]

The issue in non-UTF8 mode appears to be for characters with byte '0x81'. Foe e.g. character '私' has UTF-8 encoding '0xE7 0xA7 0x81' and 'の' has UTF-8 encoding '0xE3 0x81 0xAE'

As far as I understand, InputStream "in.read(backbytes)" is simply reading the bytes of data sent. Why should the read bytes be affected in case of JVM charset being UTF-8 and non-UTF8? Is the function 'read' locale dependent?


Solution

  • The constructor you chose, String(byte[] encoded, int offset, int length), uses the default platform encoding to convert bytes to characters. It explicitly depends on the environment in which it runs.

    This is a bad choice for portable code. For network applications, explicitly specify the encoding to be used. You can negotiate this as part of the network protocol, or specify a useful default like UTF-8.

    There are a variety of APIs that encode and decode text. For example, the String constructor String(byte[] encoded, int offset, int length, Charset encoding) can be used like this:

    String str = new String(backbytes, 0, c, StandardCharsets.UTF_8);