Search code examples
pythonstringencodingasciidata-processing

Python - Same letter but different font/ascii code/encoding?


I am trying to make a simple algorithm that takes some text and predicts some probabilities. In the way of processing text as individual words, I realized that I have words/characters that have different ascii code even though their meaning for me/my algorithm should be the same. For example, I found this 'are' and 'are'. The first one seems to have a different font or something like that. If I get their hash code in Python is also different:

hash('are') #5179570038677318294
hash('are') #-5669913536749823475

If I check the individual ascii code:

ord('a') #65345
ord('a') #97

So, I was wondering if anybody knows why is this exactly and, of course, any work around. The ideal result I would like is to consider 'are' and 'are' as the same word so that I can group them. (The data is stored in a .csv file in a virtual machine)

Thanks for the help!

EDIT:

In case this is useful, the best solution for me was:

from unidecode import unidecode
unidecode(text)

Since this could easily also remove accents.


Solution

  • You can normalize the string using unicodedata.normalize:

    import unicodedata
    ord(unicodedata.normalize('NFKC', 'a')) # 97