Search code examples
pythonpandasnlpnltkedit-distance

edit_distance with 'findall' in pandas


I have a list of tokens and need to find them in a text. I'm using pandas to store my text. However, I noticed that sometimes the tokens I am looking for are misspelled and thus I am thinking about adding the Levenshtein distance to pick those misspelled tokens. At the moment, I implemented a very simple approach:

df_texts['Text'].str.findall('|'.join(list_of_tokens))

That works perfectly find. My question is how to add edit_distance to account for misspelled tokens? NLTK packages offers a nice function to compute edit distance:

from nltk.metrics import edit_distance

>> edit_distance('trazodone', 'trazadon')
>> 2

In the above example, trazodone is the correct token, while trazadon is misspelled one and should be retrieved from my text.

In theory, I can check every single word in my texts and measure the edit distance to decided on whether they are similar or not, but it would be very inefficient. Any pythonian ideas?


Solution

  • I would start by using a "spell check" function to get a list of all words in the corpus which are not spelled correctly. This will cut down the data set massively. Then you can brute-force the misspelled words using edit_distance against all the search tokens whose length is similar enough (say, within one or two characters of the same length).

    You can pre-compute a dict of the search tokens keyed by their length, so when you find a misspelled word like "portible" you can check its edit distance from all your search tokens having 7, 8, or 9 characters.