After using PyTesseract I have a dataframe with words(they are in Greek but it does not matter). I have also created a list(words_list) that is my custom dictionary ,containing specific words of the topic that I am examining.
What I want to do is,to compare every word in df["no_punctuation"] with every word in the list and
Essentially,it is a step for my own spellchecker however I can not make it work so far.
An image of the dataframe and the list are attached.
What I have tried so far is this :
for j in range (0,len(df2)):
for word in words_list:
if (enchant.utils.levenshtein(df2["no_punctuation"][j],word)<4):
df2["new"][j]=word
else:
df2["new"][j]=""
And as it is presented in the dataframe image,it corrects only the word "generali" and leaves all the rest cells empty.However,there are many other cells that should be completed too.
I have also tried the below,however it produces only empty cells.
df2['new']=df2["no_punctuation"].apply(lambda x:"" if (enchant.utils.levenshtein(text,word)>=4 for word in words_list) else word )
I think I am close,but still something is wrong.Any ideas?
The reason for empty cells is the else condition you provided. So, for all comparisons with Levenshtein distance >4 the empty string is entered. Removing the else condition will definitely solve your problem.
Also, define a new column outside the loop.
df2["new"][j]==""
for j in range (len(df2)):
for word in words_list:
if (enchant.utils.levenshtein(df2["no_punctuation"][j],word)<4):
df2["new"][j]=word