Search code examples
pythonunicodepython-unicodeunicode-normalization

unicode normalization: dotless i + accent


Let's combine a regular i with a combining acute accent, and normalize the result (using Python's unicodedata.normalize):

from unicodedata import normalize

normalize("NFC", "i\N{COMBINING ACUTE ACCENT}").encode("ascii", "namereplace")
b'\\N{LATIN SMALL LETTER I WITH ACUTE}'

As expected: a small i with the dot swapped out for an acute accent, í.

Let's do the same with a dotless i:

from unicodedata import normalize

normalize("NFC", "\N{LATIN SMALL LETTER DOTLESS I}\N{COMBINING ACUTE ACCENT}").encode("ascii", "namereplace")
b'\\N{LATIN SMALL LETTER DOTLESS I}\\N{COMBINING ACUTE ACCENT}'

As you can see, it does not combine. Other implementations, e.g., this one, do the same.

Why not? Is this consistent with the Unicode standard?


Solution

  • From The Unicode Standard, Version 14.0, Diacritics on i and j (highlighting by myself):

    A dotted (normal) i or j followed by some common nonspacing marks above loses the dot in rendering. Thus, in the word naïve, the ï could be spelled with i + diaeresis. A dotted-i is not equivalent to a Turkish dotless-i + overdot, nor are other cases of accented dotted-i equivalent to accented dotless-i (for example, i + ¨ ≠ ı + ¨).