I have a list of tokens and need to find them in a text. I'm using pandas
to store my text. However, I noticed that sometimes the tokens I am looking for are misspelled and thus I am thinking about adding the Levenshtein distance to pick those misspelled tokens. At the moment, I implemented a very simple approach:
df_texts['Text'].str.findall('|'.join(list_of_tokens))
That works perfectly find. My question is how to add edit_distance
to account for misspelled tokens? NLTK
packages offers a nice function to compute edit distance:
from nltk.metrics import edit_distance
>> edit_distance('trazodone', 'trazadon')
>> 2
In the above example, trazodone
is the correct token, while trazadon
is misspelled one and should be retrieved from my text.
In theory, I can check every single word in my texts and measure the edit distance to decided on whether they are similar or not, but it would be very inefficient. Any pythonian ideas?
I would start by using a "spell check" function to get a list of all words in the corpus which are not spelled correctly. This will cut down the data set massively. Then you can brute-force the misspelled words using edit_distance
against all the search tokens whose length is similar enough (say, within one or two characters of the same length).
You can pre-compute a dict of the search tokens keyed by their length, so when you find a misspelled word like "portible" you can check its edit distance from all your search tokens having 7, 8, or 9 characters.