As we know, UTF-16 is variable-length when there is a character over U+10000.
However, .Net, Java and Windows WCHAR
UTF-16 string is treated as if they are fixed-length... What happens if I use over U+10000?
And if they process over U+10000, how do they process? For example, in .Net and Java char
is 16bit. so one char
cannot process over U+10000..
(.net, java and windows is just example.. I'm talking about how to process over U+10000. But I think I'd rather know how they process over U+10000, for my understanding)
thanks to @dystroy, I know how they process. But there is one problem: If string use UTF-16 surrogate, a random access operation, such as str[3]
, is O(N) algorithm because any character can be 4-byte or 2-byte! How is this problem treated?
I answered the first part of the question in this QA : Basically, some characters simply are spread over more than one Java char
.
To answer the second part related to random access to unicode points str[3]
, there are more than one method :
codePointCount
counts code pointsAnd yes, counting the code points is costly and basically O(N)
. Here's how it's done in Java :
2665 static int More ...codePointCountImpl(char[] a, int offset, int count) {
2666 int endIndex = offset + count;
2667 int n = 0;
2668 for (int i = offset; i < endIndex; ) {
2669 n++;
2670 if (isHighSurrogate(a[i++])) {
2671 if (i < endIndex && isLowSurrogate(a[i])) {
2672 i++;
2673 }
2674 }
2675 }
2676 return n;
2677 }
UTF-16 is a bad format to deal with code points, especially if you leave the BMP. Most programs simply don't handle code points, which is the reason this format is usable. Most String operations are fast because they don't deal with code points : all standard API take char
indexes as arguments, not worrying about what kind of rune points they do have behind.