consider the following:
public static void main(String... strings) throws Exception {
byte[] b = { -30, -128, -94 };
//section utf-32
String string1 = new String(b,"UTF-32");
System.out.println(string1); //prints ?
printBytes(string1.getBytes("UTF-32")); //prints 0 0 -1 -3
printBytes(string1.getBytes()); //prints 63
//section utf-8
String string2 = new String(b,"UTF-8");
System.out.println(string2); // prints •
printBytes(string2.getBytes("UTF-8")); //prints -30 -128 -94
printBytes(string2.getBytes()); //prints -107
}
public static void printBytes(byte[] bytes){
for(byte b : bytes){
System.out.print(b + " " );
}
System.out.println();
}
output:
?
0 0 -1 -3
63
•
-30 -128 -94
-107
so I have two questions:
getBytes()
and getBytes(charSet)
are different even though I have specifically mentioned the string's charsetgetByte
in section utf-32 are different from the actual byte[] b
? (i.e. how can I convert back a string to its original byte array?)Question 1:
in both sections : why the output
getBytes()
andgetBytes(charSet)
are different even though I have specifically mentioned the string's charset
The character set you've specified is used during character encoding of the string to the byte array (i.e. in the method itself only). It's not part of the String
instance itself. You are not setting the character set for the string, the character set is not stored.
Java does not have an internal byte encoding of the character set; it uses arrays of char
internally. If you call String.getBytes()
without specifying a character set, it will use the platform default - e.g. Windows-1252 on Windows machines.
Question 2:
why both of the byte outputs of
getByte
in section utf-32 are different from the actualbyte[] b
? (i.e. how can I convert back a string to its original byte array?)
You cannot always do this. Not all bytes represent a valid encoding of characters. So if such an encoded array is decoded then these kind of encodings are silently ignored, i.e. the bytes are simply skipped.
This already happens during String string1 = new String(b,"UTF-32");
and String string2 = new String(b,"UTF-8");
.
You can change this behavior using an instance of CharsetDecoder
, retrieved using Charset.newDecoder
.
If you want to encode a random byte array into a String instance then you should use a hexadecimal or base 64 encoder. You should not use a character decoder for that.