Suppose I have a dataframe df1
:
Sr A B C
1 rains It rain there. It rains there
2 plane This is a vertical planes This is a vertical plane
3 tree Plant a trees Plant a tree
Column C
is my expected output. I need to compare each word in strings of column B with the word in A and replace it if Levenshtein distance is 1.
My approach:
import jellyfish as jf
def word_replace(str1):
comp = #don't know how to store value of column A in this variable.
for word in str1.split():
if jf.levenshtein_distance(word,comp) == 1:
word = comp
else:
pass
return str1
df1['C'] = df1['B'].apply(word_replace)
Second thing , what if column A
has double words like "near miss"
? How will I need to modify the above code? E.g.:
Sr A B C
1 near miss that was a ner mis that was a near miss
You have asked two questions in one which is never a good idea on Stack Overflow. I'm just going to reply to your first question, if you want someone to look at your second problem then I suggest you write a new question specifically for it.
pd.DataFrame.apply
can work either across rows or across columns, you wish to work on each row individually and so you must pass the axis=1
keyword argument.
Below is some code that solves your problem, it uses a list comprehension making use of a ternary operator to choose which words need replacing. This list is then joined together using str.join()
.
Originally your code was iterating over the split strings but that will not work as you cannot modify them as you are iterating over the list. It was also assuming that the input to the function would be a string, this is incorrect as instead it will be a pandas.Series
object.
This is a simplified piece of code and does not take into account things like punctuation, that I leave as an exercise to the reader.
import pandas as pd
import jellyfish as jf
data1 = {'A':['rains','plane','tree'],'B':['It rain there','This is a vertical planes','Plant a trees']}
df1 = pd.DataFrame(data1)
def word_replace(row):
comp = row['A']
str1 = row['B']
out = ' '.join([comp if jf.levenshtein_distance(word, comp) == 1
else word for word in str1.split()])
return out
df1['C'] = df1.apply(word_replace, axis=1)