I was using a Python script that involved the str.upper() and str.lower() functions when I stumbled upon a strange result. When I passed the letter ᾉ
(capital alpha with dasia and prosgegrammeni, U+1F89) to the upper()
function, the result was ἉΙ
instead of the expected ᾉ
.
Code to reproduce:
print('ᾉ'.upper())
Prints
ἉΙ
Is this an expected behavior or a bug of some sort?
Edit: I replaced with the correct characters.
Inspecting the symbols (for example, using this online tool) tells me you have a U+1F89 GREEK CAPITAL LETTER ALPHA WITH DASIA AND YPROSGEGRAMMENI
(not U+1F88).
Looking up that term, we get a Wikipedia article on iota subscripts:
In uppercase-only environments, it is represented again either as slightly reduced iota (smaller than regular lowercase iota), or as a full-sized uppercase Iota.
You'd need someone with Ancient Greek knowledge to verify this, but at first glance, the result is logically equivalent to what you had initially.
Now, careful reading of the Section 3.13 of the Unicode standard reveals that the symbol you have is actually mentioned explicitly as an exception:
The invocations of canonical decomposition (NFD normalization) before case folding in D145 are to catch very infrequent edge cases. Normalization is not required before casefolding, except for the character U+0345 ncombining greek ypogegrammeni and anycharacters that have it as part of their canonical decomposition, such as
U+1FC3 greek small letter eta with ypogegrammeni
.
Moreover, as per Wikipedia,
For use in all-capitals ("uppercase"), Unicode additionally stipulates a special case-mapping rule according to which lowercase letters should be mapped to combinations of the uppercase letter and uppercase iota (ᾳ → ΑΙ). This rule not only replaces the representation of a monophthong with that of a diphthong, but it also destroys the reversibility of any capitalization process in digital environments, as the combination of uppercase letter and uppercase iota would normally be converted back to lowercase letter and lowercase iota.
Apparently you've hit a weird edge case in the Unicode standard, so this is expected, rather than a bug in Python's str.upper()
.