Search code examples
javaunicodeencodingutf-8string-length

How can a 21 byte UTF-8 sequence come from just 5 characters?


After writing some basic code to count the number of characters in a String, I've found one example where the UTF-8 encoded output creates 21 bytes from a 5 "character" String.

Here's the output:

String ==¦ อภิชาติ ¦==
Code units 7
UTF8 Bytes 21
8859 Bytes 7
Characters 5

I understand that Java's internal representation of a char is 2 bytes and there is a possibility that some characters may require two Unicode code units to display them.

As UTF-8 doesn't use any more than 4 bytes per character, how is a byte[] length of more than 20 possible for a 5 character String?

Here's the source:

import java.io.UnsupportedEncodingException;

public class StringTest {

    public static void main(String[] args) {
        displayStringInfo("อภิชาติ");
    }

    public static void displayStringInfo(String s) {
        System.out.println("Code units " + s.length());     
        try {
            System.out.println("UTF8 Bytes " + s.getBytes("UTF-8").length);
        } catch (UnsupportedEncodingException e) { // not handled }
        System.out.println("Characters " + characterLength(s));
    }

    public static int characterLength(String s) {
        int count = 0;
        for(int i=0; i<s.length(); i++) {
            if(!isLeadingUnit(s.charAt(i)) && !isMark(s.charAt(i))) count++;
        }
        return count;
    }

    private static boolean isMark(char ch) {
        int type = Character.getType(ch);
        return (type == Character.NON_SPACING_MARK ||
               type == Character.ENCLOSING_MARK ||
               type == Character.COMBINING_SPACING_MARK);
    }

    private static boolean isLeadingUnit(char ch) {
        return Character.isHighSurrogate(ch);
    }
}

Solution

  • Your "5 character" string actually consists of 7 Unicode code points:

    • U+0E2D THAI CHARACTER O ANG
    • U+0E20 THAI CHARACTER PHO SAMPHAO
    • U+0E34 THAI CHARACTER SARA I
    • U+0E0A THAI CHARACTER CHO CHANG
    • U+0E32 THAI CHARACTER SARA AA
    • U+0E15 THAI CHARACTER TO TAO
    • U+0E34 THAI CHARACTER SARA I

    All of them are in the U+0800 to U+FFFF range that requires 3 bytes per character in UTF-8, hence a total length of 7×3 = 21 bytes.