Search code examples
pythonstringnon-ascii-characters

Python returns length of 2 for single non-ascii character string


I am trying to get the span of selected words in a string. When working with the İ character, I noticed the following behavior of Python:

len("İ")
Out[39]: 1

len("İ".lower())
Out[40]: 2

# when `upper()` is applied, the length stays the same
len("İ".lower().upper())
Out[41]: 2

Why does the length of the upper and lowercase value of the same character differ (that seems very confusing/undesired to me)?

Does anyone know if there are other characters for which that will happen? Thank you!

EDIT:

On the other hand for e.g. Î the length stays the same:

len('Î')
Out[42]: 1

len('Î'.lower())
Out[43]: 1

Solution

  • That's because 'İ' in lowercase is 'i̇', which has 2 characters

    >>> import unicodedata
    >>> unicodedata.name('İ')
    'LATIN CAPITAL LETTER I WITH DOT ABOVE'
    >>> unicodedata.name('İ'.lower()[0])
    'LATIN SMALL LETTER I'
    >>> unicodedata.name('İ'.lower()[1])
    'COMBINING DOT ABOVE'
    

    One character is a combining dot that your browser might render overlapped with the last quote, so you may not be able to see it. But if you copy-paste it into your python console, you should be able to see it.


    If you try:

    print('i̇'.upper())
    

    you should get

    İ