I have some data containing spelling errors. For example:
# Define the correct spellings:
Li_A = ["potato", "tomato", "squash", "apple", "pear"]
# Define the data that contains spelling errors:
B = {'one' : pd.Series(["potat0", "toma3o", "s5uash", "ap8le", "pea7"], index=['a', 'b', 'c', 'd', 'e']),
'two' : pd.Series(["po1ato", "2omato", "squ0sh", "2pple", "p3ar"], index=['a', 'b', 'c', 'd', 'e'])}
df_B = pd.DataFrame(B)
I'm trying to correct them using the following code:
import pandas as pd
import difflib
# Define the function that corrects the spelling:
def Spelling(ask):
difflib.get_close_matches(ask, Li_A, n=1, cutoff=0.5)
# Apply the function that corrects the spelling:
for index,row in df_B.iterrows():
df_B.loc[index,'Correct one'] = Spelling(df_B['one'])
for index,row in df_B.iterrows():
df_B.loc[index,'Correct two'] = Spelling(df_B['two'])
df_B
But all that I get out is:
one two Correct one Correct two
a potat0 po1ato NaN NaN
b toma3o 2omato NaN NaN
c s5uash squ0sh NaN NaN
d ap8le 2pple NaN NaN
e pea7 p3ar NaN NaN
How do I get the correct spellings to be added as new columns on my dataframe where it currently says "Nan" please?
It does work when I run it on one word at a time:
import difflib
Li_A = ["potato", "tomato", "squash", "apple", "pear"]
B = 'potat0'
C = difflib.get_close_matches(B, Li_A, n=1, cutoff=0.5)
C
Out: ['potato']
You forget for return
in function and in iterrows
use row
for select value per loop, also iterrows
use only once:
def Spelling(ask):
return difflib.get_close_matches(ask, Li_A, n=1, cutoff=0.5)
# Apply the function that corrects the spelling:
for index,row in df_B.iterrows():
df_B.loc[index,'Correct one'] = Spelling(row['one'])
df_B.loc[index,'Correct two'] = Spelling(row['two'])
print (df_B)
one two Correct one Correct two
a potat0 po1ato [potato] [potato]
b toma3o 2omato [tomato] [tomato]
c s5uash squ0sh [squash] [squash]
d ap8le 2pple [apple] [apple]
e pea7 p3ar [pear] [pear]
But simplier is use applymap
:
df_B[['Correct one','Correct two']] = df_B[['one','two']].applymap(Spelling)
print (df_B)
one two Correct one Correct two
a potat0 po1ato [potato] [potato]
b toma3o 2omato [tomato] [tomato]
c s5uash squ0sh [squash] [squash]
d ap8le 2pple [apple] [apple]
e pea7 p3ar [pear] [pear]