Search code examples
unicodeutf-16

UTF-16 string : how to process over U+10000?


As we know, UTF-16 is variable-length when there is a character over U+10000.

However, .Net, Java and Windows WCHAR UTF-16 string is treated as if they are fixed-length... What happens if I use over U+10000?

And if they process over U+10000, how do they process? For example, in .Net and Java char is 16bit. so one char cannot process over U+10000..

(.net, java and windows is just example.. I'm talking about how to process over U+10000. But I think I'd rather know how they process over U+10000, for my understanding)


thanks to @dystroy, I know how they process. But there is one problem: If string use UTF-16 surrogate, a random access operation, such as str[3], is O(N) algorithm because any character can be 4-byte or 2-byte! How is this problem treated?


Solution

  • I answered the first part of the question in this QA : Basically, some characters simply are spread over more than one Java char.

    To answer the second part related to random access to unicode points str[3], there are more than one method :

    • charAt is careless and only handle chars in a fast and obvious way
    • codePointAt returns a 32 bits int (but need a char index)
    • codePointCount counts code points

    And yes, counting the code points is costly and basically O(N). Here's how it's done in Java :

    2665    static int More ...codePointCountImpl(char[] a, int offset, int count) {
    2666        int endIndex = offset + count;
    2667        int n = 0;
    2668        for (int i = offset; i < endIndex; ) {
    2669            n++;
    2670            if (isHighSurrogate(a[i++])) {
    2671                if (i < endIndex && isLowSurrogate(a[i])) {
    2672                    i++;
    2673                }
    2674            }
    2675        }
    2676        return n;
    2677    }
    

    UTF-16 is a bad format to deal with code points, especially if you leave the BMP. Most programs simply don't handle code points, which is the reason this format is usable. Most String operations are fast because they don't deal with code points : all standard API take char indexes as arguments, not worrying about what kind of rune points they do have behind.