Search code examples
pythonstringunicode

Unicode look-alikes


I am looking for an easy way to match all unicode characters that look like a given letter. Consider an example for a selection of characters that look like small letter n.

import re
import sys

sys.stdout.reconfigure(encoding='utf-8')

look_alike_chars = [ 'n', 'ń', 'ņ', 'ň', 'ʼn', 'ŋ', 'ո', 'п', 'ո']
pattern = re.compile(r'[nńņňʼnŋոп]')


for char in look_alike_chars:
    if pattern.match(char):
        print(f"Character '{char}' matches the pattern.")
    else:
        print(f"Character '{char}' does NOT match the pattern.")

Instead of r'[nńņňʼnŋոп]' I expect something like r'[\LookLike{n}] where LookLike is a token.

If this is not possible, do you know of a program or website that lists all the look-alike symbols for a given ASCII letter?


Solution

  • There is a Python library called Unidecode that will translate some of these special Unicode character to the corresponding ascii character. A code sample should be this:

    import unidecode
    
    input_string = "ñ"
    clean_string = unidecode.unidecode(input_string) # = 'n'
    

    However, the correspondences are made based on their actual idiom equivalent.

    So whenever you want to translate a capital Greek 'PI' (written п), it will return a 'p' because pi is equivalent to a 'p' and not to an 'n'. Besides there is no publicly available database of such correspondences. Feel free to build one and share it!