Search code examples
pythonstringunicodeascii

Replace Enclosed Alphanumerics in Python


How can I replace characters such as 🅰, ①,Ⓐ to their 'normal' forms in python?

Note, however, I would still like accent character to remain the same (ç,á etc)

Hoping I don't need to make a dictionary.

More examples:

https://en.wikipedia.org/wiki/Enclosed_Alphanumerics

https://en.wikipedia.org/wiki/Enclosed_Alphanumeric_Supplement

http://xahlee.info/comp/unicode_circled_numbers.html

-=-=EDIT=--=-

For future people I used this, thanks AKX

(
         "".join(
             re.sub(r'([^\w])', '', unidecode.unidecode(ch)) #return only the word characters from the decode
                 if 
                         ord(ch) > 65535  #Emoticons
                     or (ord(ch) >= 9312 and ord(ch) <= 9472)       #circled
                     or (ord(ch) >= 65296 and ord(ch) <= 65305)     #latin numbers
                     or (ord(ch) >= 65313 and ord(ch) <= 65338)     #latin upper
                     or (ord(ch) >= 65345 and ord(ch) <= 9472)      #latin lower
                 else 
                     ch 
                 for ch in criteria
                   )
              )

Solution

  • There's no universal way, but happily someone else has made the dictionary for you: the unidecode library works for your particular characters.

    $ pip install unidecode
    Successfully installed unidecode-1.3.4
    $ python3
    >>> import unidecode
    >>> unidecode.unidecode('🅰, ①,Ⓐ')
    '[A], 1,A'
    >>>
    

    Unidecode will, however, also deburr accented characters, so you'll need a bit more to keep them.

    For example, assuming all the strange characters are not in the Unicode basic plane, you can filter the string character by character:

    >>> "".join(unidecode.unidecode(ch) if ord(ch) > 65535 else ch for ch in '🅰 Oh lá lá')
    '[A] Oh lá lá'