I'm trying to print the first 30 characters of some UTF-8 strings, and notice that Java's String.substring()
is returning some funky strings. I've boiled it down to:
I'm expecting "🤣" to be String with length 1, and String.substring
to not try to cut it over in the middle. Why is my expectation not met? Java thinks it has length 2.
I'm pretty sure (1 2) the UTF-8 encoding for 🤣 (U+1F923) "Rolling On the Floor Laughing" is:
0xF0 0x9F 0xA4 0xA3
And so I expect this tiny program:
import java.nio.charset.StandardCharsets;
public class Foo {
public static void main(String[] args){
String str = "🤣";
// These are the UTF-8 bytes for "ROLLING ON THE FLOOR LAUGHING"
byte[] raw = {(byte)0xf0, (byte)0x9f, (byte)0xa4, (byte)0xa3};
String str2 = new String(raw, StandardCharsets.UTF_8);
System.out.println(str.equals(str2));
System.out.println(str.length());
System.out.println(str.substring(0,1));
}
}
To print out:
true
1
🤣
But in fact it prints out:
true
2
?
Am I doing something wrong?
I've tried an custom java 11.0.20.1 build and these standard Ubuntu packages with the same results:
$ javac -version
javac 19.0.2
$ java -version
openjdk version "19.0.2" 2023-01-17
OpenJDK Runtime Environment (build 19.0.2+7-Ubuntu-0ubuntu322.04)
OpenJDK 64-Bit Server VM (build 19.0.2+7-Ubuntu-0ubuntu322.04, mixed mode, sharing)
python3 does what I expect:
$ python3 -c 'print(len("🤣"))'
1
$ python3 -c 'print("🤣"[0])'
🤣
Java stores strings as UTF-16 encoded (or close to). 🤣 takes up two UTF-16 code units (0xd83e, 0xdd23). Hence the reason why its length is 2 and the first UTF-16 code unit on its own doesn't make any sense.
The Java string functions work on individual UTF-16 code units, not full Unicode characters.