Search code examples
localeunicode-normalization

Is ASCII-only Unicode string always normalized?


Imagine a string of single ASCII character i (U+0069). In Turkish and akin writing system, ı (U+0131) is present as well. Can Unicode normalization split U+0069 (i) into U+0131 U+0307 (ı̇)? Is it locale-dependent, and so might vary on environment?


Solution

  • The normali\ation forms defined by Unicode are not locale-specific; they have no input other than the sequence of code points to be normalized.

    The Unicode website has a user-friendly chart of all characters which differ between the standardized normalization forms.

    Unfortunately, it is grouped by script, not by block, so we can't quickly check all the characters in the "Basic Latin" block (which matches the 128 characters of ASCII).

    Searching for "0069" specifically, we see that it appears as the result of normalising certain code points - either as part of a "decomposition" in NFD, or as a compatibility replacement in forms NFKC and NFKD. However, it doesn't appear in the input column, because it doesn't change when converted to any of the normalization forms.

    I have not checked the other Basic Latin characters, but would be extremely surprised if any of them normalize to anything other than themselves. So to answer your original question: yes, I believe a string that only uses code points U+0000 to U+007F (the code points inherited from the 7-bit ASCII standard) will not change in any of the normalization forms defined by Unicode.