Search code examples
pythonpython-3.xunicode

How to compare Bengali homoglyph words or characters in Python?


s1='ফটিকছড়ি' #escape-unicode= %u09AB%u099F%u09BF%u0995%u099B%u09A1%u09BC%u09BF
s2='ফটিকছড়ি' #escape-unicode= %u09AB%u099F%u09BF%u0995%u099B%u09DC%u09BF

They are looking the same but are different. How can I consider them as the same string?


Solution

  • In Unicode, the character U+09DC is canonically equivalent to the sequence U+09A1 U+09BC. When you compare Unicode strings, you should always use Unicode normalization to fold together canonically equivalent sequences. So, convert both strings to Unicode normalization form C or Unicode normalization form D before comparing.

    See UAX #15 Unicode Normalization Forms for details on Unicode normalization.

    See this answer for how to normalize Unicode strings in Python.