Search code examples
pythonstringpython-unicode

how to remove just the accents, but not umlauts from strings in Python


I'm using following code

import unicodedata
def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
              if unicodedata.category(c) != 'Mn')
strip_accents('ewaláièÜÖ')

which returns

'ewalaieUO'

But I want it to return

'ewalaieÜÖ'

Is there any easier way than replacing the characters with str.replace(char_a,char_b) ? How can I handle this efficiently ?


Solution

  • So let's start with your test input:

    In [1]: test
    Out[1]: 'ewaláièÜÖ'
    

    See what's happening with it when normalizing:

    In [2]: [x for x in unicodedata.normalize('NFD', test)]
    Out[2]: ['e', 'w', 'a', 'l', 'a', '́', 'i', 'e', '̀', 'U', '̈', 'O', '̈']
    

    And here are unicodedata categories of each normalized elements:

    In [3]: [unicodedata.category(x) for x in unicodedata.normalize('NFD', test)]
    Out[3]: ['Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Mn', 'Ll', 'Ll', 'Mn', 'Lu', 'Mn', 'Lu', 'Mn']
    

    As you can see, not only "accents", but also "umlauts" are in category Mn. So what you can use instead of unicodedata.category is unicodedata.name

    In [4]: [unicodedata.name(x) for x in unicodedata.normalize('NFD', test)]
    Out[4]: ['LATIN SMALL LETTER E',
     'LATIN SMALL LETTER W',
     'LATIN SMALL LETTER A',
     'LATIN SMALL LETTER L',
     'LATIN SMALL LETTER A',
     'COMBINING ACUTE ACCENT',
     'LATIN SMALL LETTER I',
     'LATIN SMALL LETTER E',
     'COMBINING GRAVE ACCENT',
     'LATIN CAPITAL LETTER U',
     'COMBINING DIAERESIS',
     'LATIN CAPITAL LETTER O',
     'COMBINING DIAERESIS']
    

    Here accents names are COMBINING ACUTE/GRAVE ACCENT, and "umlauts" names are COMBINING DIAERESIS. So here is my suggestion, how to fix your code:

    def strip_accents(s):
        return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if not unicodedata.name(c).endswith('ACCENT')) 
    
    strip_accents(test)
    'ewalaieÜÖ'
    

    Also as you can read from unicodedata documentation this module is just a wrapper for database available here, so please take a look at list of names from that database to make sure this covers all cases you need.