Search code examples
unicodenormalizationunicode-normalizationcase-folding

Normalization needed after case folding


Given a NFC normalized string, applying full case folding to that string, can I assume that the result is NFC normalized too?

I don't understand what the Unicode standard is trying to tell me in this quote:

Normalization also interacts with case folding. For any string X, let Q(X) = NFC(toCasefold(NFD(X))). In other words, Q(X) is the result of normalizing X, then case folding the result, then putting the result into Normalization Form NFC format. Because of the way normalization and case folding are defined, Q(Q(X)) = Q(X). Repeatedly applying Q does not change the result; case folding is closed under canonical normalization for either Normalization Form NFC or NFD.


Solution

  • A Unicode string might not be in NFC after case folding. An example is U+00DF (LATIN SMALL LETTER SHARP S) followed by U+0301 (COMBINING ACUTE ACCENT).

    X = U+00DF U+0301
    NFC(X) = U+00DF U+0301
    toCasefold(NFC(X)) = U+0073 U+0073 U+0301
    NFC(toCasefold(NFC(X))) = U+0073 U+015B