Search code examples
javac#unicodeglyphcodepoint

How can I tell if a Unicode code point is one complete printable glyph(or grapheme cluster)?


Let's say there's a Unicode String object, and I want to print each Unicode character in that String one by one. In my simple test with very limited languages, I could successively achieve this just assuming one code point is always the same as one glyph.

But I know this is not the case, and the code logic above may easily cause unexpected results in some countries or languages.

So my question is, is there any way to tell if one Unicode code point is one complete printable glyph in Java or C#? If I have to write code in C/C++, that's fine too.

I googled for hours but all I got is about code units and code points. It's very easy to tell if a code unit is a part of a surrogate-pair but nothing about graphemes..

Could anyone point me in the right direction, please?


Solution

  • You're definitely right that a single glyph is often composed of more than one code point. For example, the letter é (e with acute accent) may be equivalently written \u00E9 or with a combining accent as \u0065\u0301. Unicode normalization cannot always merge things like this into one code point, especially if there are multiple combining characters. So you'll need to use some Unicode segmentation rules to identify the boundaries you want.

    What you are calling a "printable glyph" is called a user-perceived character or (extended) grapheme cluster. In Java, the way to iterate over these is with BreakIterator.getCharacterInstance(Locale):

    BreakIterator boundary = BreakIterator.getCharacterInstance(Locale.WHATEVER);
    boundary.setText(yourString);
    for (int start = boundary.first(), end = boundary.next();
            end != BreakIterator.DONE;
            start = end, end = boundary.next()) {
        String chunk = yourString.substring(start, end);
    }