I am looking for an easy way to match all unicode characters that look like a given letter. Consider an example for a selection of characters that look like small letter n
.
import re
import sys
sys.stdout.reconfigure(encoding='utf-8')
look_alike_chars = [ 'n', 'ń', 'ņ', 'ň', 'ʼn', 'ŋ', 'ո', 'п', 'ո']
pattern = re.compile(r'[nńņňʼnŋոп]')
for char in look_alike_chars:
if pattern.match(char):
print(f"Character '{char}' matches the pattern.")
else:
print(f"Character '{char}' does NOT match the pattern.")
Instead of r'[nńņňʼnŋոп]'
I expect something like r'[\LookLike{n}]
where LookLike
is a token.
If this is not possible, do you know of a program or website that lists all the look-alike symbols for a given ASCII letter?
There is a Python library called Unidecode that will translate some of these special Unicode character to the corresponding ascii character. A code sample should be this:
import unidecode
input_string = "ñ"
clean_string = unidecode.unidecode(input_string) # = 'n'
However, the correspondences are made based on their actual idiom equivalent.
So whenever you want to translate a capital Greek 'PI' (written п), it will return a 'p' because pi is equivalent to a 'p' and not to an 'n'. Besides there is no publicly available database of such correspondences. Feel free to build one and share it!