How can I replace characters such as 🅰, ①,Ⓐ to their 'normal' forms in python?
Note, however, I would still like accent character to remain the same (ç,á etc)
Hoping I don't need to make a dictionary.
More examples:
https://en.wikipedia.org/wiki/Enclosed_Alphanumerics
https://en.wikipedia.org/wiki/Enclosed_Alphanumeric_Supplement
http://xahlee.info/comp/unicode_circled_numbers.html
-=-=EDIT=--=-
For future people I used this, thanks AKX
(
"".join(
re.sub(r'([^\w])', '', unidecode.unidecode(ch)) #return only the word characters from the decode
if
ord(ch) > 65535 #Emoticons
or (ord(ch) >= 9312 and ord(ch) <= 9472) #circled
or (ord(ch) >= 65296 and ord(ch) <= 65305) #latin numbers
or (ord(ch) >= 65313 and ord(ch) <= 65338) #latin upper
or (ord(ch) >= 65345 and ord(ch) <= 9472) #latin lower
else
ch
for ch in criteria
)
)
There's no universal way, but happily someone else has made the dictionary for you: the unidecode
library works for your particular characters.
$ pip install unidecode
Successfully installed unidecode-1.3.4
$ python3
>>> import unidecode
>>> unidecode.unidecode('🅰, ①,Ⓐ')
'[A], 1,A'
>>>
Unidecode will, however, also deburr accented characters, so you'll need a bit more to keep them.
For example, assuming all the strange characters are not in the Unicode basic plane, you can filter the string character by character:
>>> "".join(unidecode.unidecode(ch) if ord(ch) > 65535 else ch for ch in '🅰 Oh lá lá')
'[A] Oh lá lá'