Search code examples
pythonlist-comprehensionstring-comparisonspell-checkinglevenshtein-distance

Compare strings in dataframe column according to levenshtein distance with words in a list


After using PyTesseract I have a dataframe with words(they are in Greek but it does not matter). I have also created a list(words_list) that is my custom dictionary ,containing specific words of the topic that I am examining.

What I want to do is,to compare every word in df["no_punctuation"] with every word in the list and

  • If the levenshtein distance between the pair of the words is lower than 4, I want to replace the word in the dataframe with the according word from the list
  • Otherwise,I want to leave the cell empty

Essentially,it is a step for my own spellchecker however I can not make it work so far.

An image of the dataframe and the list are attached.

dataframe list

What I have tried so far is this :

for j in range (0,len(df2)):
    for word in words_list:
        if (enchant.utils.levenshtein(df2["no_punctuation"][j],word)<4):
            df2["new"][j]=word
        else:
            df2["new"][j]=""

And as it is presented in the dataframe image,it corrects only the word "generali" and leaves all the rest cells empty.However,there are many other cells that should be completed too.

I have also tried the below,however it produces only empty cells.

df2['new']=df2["no_punctuation"].apply(lambda x:"" if (enchant.utils.levenshtein(text,word)>=4 for word in words_list) else word )

I think I am close,but still something is wrong.Any ideas?


Solution

  • The reason for empty cells is the else condition you provided. So, for all comparisons with Levenshtein distance >4 the empty string is entered. Removing the else condition will definitely solve your problem.

    Also, define a new column outside the loop.

    df2["new"][j]==""
    for j in range (len(df2)):
        for word in words_list:
            if (enchant.utils.levenshtein(df2["no_punctuation"][j],word)<4):
                df2["new"][j]=word