Search code examples
pythonpython-2.7unicode

How do I get the "visible" length of a combining Unicode string in Python?


If I have a Python Unicode string that contains combining characters, len reports a value that does not correspond to the number of characters "seen".

For example, if I have a string with combining overlines and underlines such as u'A\u0332\u0305BC', len(u'A\u0332\u0305BC') reports 5; but the displayed string is only 3 characters long.

How do I get the "visible" — that is, number of distinct positions occupied by the string the user sees — length of a Unicode string containing combining glyphs in Python?


Solution

  • The unicodedata module has a function combining that can be used to determine if a single character is a combining character. If it returns 0 you can count the character as non-combining.

    import unicodedata
    len(u''.join(ch for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0))
    

    or, slightly simpler:

    sum(1 for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0)
    

    Edit: as pointed out in the comments, there are code points other than combining marks that modify a character without being a character themselves that should not be in the count. Here's a more robust version of the above:

    modifier_categories = set(['Mc', 'Mn'])
    sum(1 for ch in u'A\u0332\u0305BC' if unicodedata.category(ch) not in modifier_categories)
    

    We can use another Python trick to make that even simpler, taking advantage of True==1 and False==0:

    sum(unicodedata.category(ch) not in modifier_categories for ch in u'A\u0332\u0305BC')