Search code examples
javastringunicodeencoding

Java: How do I loop through characters in a string that have a surrogate pair and print them?


I tried this to loop through the characters in my string and print them. All of them are printing fine except the Deseret Long I (𐐀). I have no idea if there are other ways to do this so that the 𐐀 is printed correctly. Here is my code:

package javaapplication13;
public class JavaApplication13 {
    public static void main(String[] args) {
        String s = "h𤍡y𐐀\u0500";
        System.out.println(s);
        final int length = s.length();
        for (int offset = 0; offset < length;) {
            final int codepoint = s.codePointAt(offset);
            System.out.println((char) (codepoint));
            offset += Character.charCount(codepoint);
        }
    }
}

The output looks like this (Netbeans):

run:
h𤍡y𐐀Ԁ
h
䍡
y
Ѐ
Ԁ
BUILD SUCCESSFUL (total time: 0 seconds)

Solution

  • Your problem is caused by the fact that you try to convert int to char (4 bytes to 2 bytes). The value in the codepoint variable cannot fit in one char in case of surrogate pair. Look, it is called pair, because it is a pair of chars. I think the simplest way how you can print it is by using String.Substring() method. Or you can convert it to array of char's this way: char[] ch = Character.toChars(codepoint); and you can convert this array back to string by simple new String(ch).