I am trying to make a simple algorithm that takes some text and predicts some probabilities. In the way of processing text as individual words, I realized that I have words/characters that have different ascii code even though their meaning for me/my algorithm should be the same. For example, I found this 'are'
and 'are'
. The first one seems to have a different font or something like that. If I get their hash code in Python is also different:
hash('are') #5179570038677318294
hash('are') #-5669913536749823475
If I check the individual ascii code:
ord('a') #65345
ord('a') #97
So, I was wondering if anybody knows why is this exactly and, of course, any work around. The ideal result I would like is to consider 'are'
and 'are'
as the same word so that I can group them. (The data is stored in a .csv file in a virtual machine)
Thanks for the help!
EDIT:
In case this is useful, the best solution for me was:
from unidecode import unidecode
unidecode(text)
Since this could easily also remove accents.
You can normalize the string using unicodedata.normalize
:
import unicodedata
ord(unicodedata.normalize('NFKC', 'a')) # 97