Search code examples
c#stringunicodeicugrapheme

C#'s StringInfo and TextElementEnumerator can't recognize graphemes properly


In C# StringInfo and TextElementEnumerator classes provide methods and properties for text elements. And here, we can find the definition of the Text Element.

The .NET Framework defines a text element as a unit of text that is displayed as a single character, that is, a grapheme. A text element can be any of the following:

Yes, it says a text element is a grapheme in .NET. I also tested with some unicode characters myself, and it really seemed true until I tested one Korean letter ''.

As we all know some Unicode characters consist of multiple code points. Also we may face code point sequences and that's the reason I'm using StringInfo and TextElementEnumerator instead of simple String.

StringInfo and TextElementEnumerator could tell if Chars were surrogate pairs correctly. And "\u0061\u0308", a Unicode character which consists of multiple code points, was recognized as one text element just as expected. But as for "\u1100\u1161", it failed to say that it was also one text element.

"\u1100" is a leading letter "ㄱ", and "\u1161" is a vowel letter "ㅏ". They can be individual characters and shown to the users just as I write here and you can see them now. But if they are used together, they are rendered as one character "가" instead of "ㄱㅏ".

There are two ways in order to represent a Korean character "가":

  1. Using a single code point U+AC00 from Hangul Syllable.
  2. Using two code points U+1100 and U+1161 from Jamo.

Most of the time the former is used. The latter is rarely used, to be honest, I can't imagine when it's used at all.. Anyway, the first one is just one precomposed letter and the second is a sequence of Lead and Vowel which is treated as one character. When rendered they look the exactly same and both are actually canonically equivalent. Also the following line returns true in C# :

"\u1100\u1161".Normalize() == "\uAC00"

I wonder why Normalize() here works just fine when C# doesn't think they are one complete text element.. I thought it had something to do with my .NET's version, but it turns out it's not the case. This thing happens even in Mono too.

I tested this with ICU as well, and it could treat "\u1100\u1161" as one grapheme correctly! I initially thought StringInfo and TextElementEnumerator could eliminate need for ICU4C in some simple cases, so I'm very disappointed now..

Here's my question :

Am I doing something wrong here?

or

A Text Element in .NET isn't a user-perceived character unlike in ICU?


Solution

  • The basic issue here is that per the Korean standard KS X 1026, the two jamos and are distinct from their combined form . In fact, this exact example is used in the official standard (see section 6.2).

    Long story short, Microsoft attempted to follow the standard but other operating systems and applications don't necessarily do so. Hence you can get "malformed" content from other software / platforms that appears to be parsed incorrectly on Windows / in .NET, even though it is parsed "correctly" on those platforms.

    You will either need to ensure your data is correctly formed in the first place (unlikely, given that the de-facto standard is to completely ignore the official standard) or you will need to use ICU (or a similar library) to deal with these cases.