Search code examples
pythonfunctionpandasdataframelevenshtein-distance

Replace word w.r.t word in another column using Levenshtein distance


Suppose I have a dataframe df1:

Sr       A              B                            C
1      rains         It rain there.             It rains there
2      plane         This is a vertical planes  This is a vertical plane
3      tree          Plant a trees              Plant a tree

Column C is my expected output. I need to compare each word in strings of column B with the word in A and replace it if Levenshtein distance is 1.

My approach:

import jellyfish as jf
def word_replace(str1):
    comp = #don't know how to store value of column A in this variable.
    for word in str1.split():
        if jf.levenshtein_distance(word,comp) == 1:
           word = comp
        else:
            pass
    return str1

df1['C'] = df1['B'].apply(word_replace)

Second thing , what if column A has double words like "near miss"? How will I need to modify the above code? E.g.:

 Sr       A              B                            C
  1     near miss        that was a ner mis          that was a near miss

Solution

  • You have asked two questions in one which is never a good idea on Stack Overflow. I'm just going to reply to your first question, if you want someone to look at your second problem then I suggest you write a new question specifically for it.

    pd.DataFrame.apply can work either across rows or across columns, you wish to work on each row individually and so you must pass the axis=1 keyword argument.

    Below is some code that solves your problem, it uses a list comprehension making use of a ternary operator to choose which words need replacing. This list is then joined together using str.join().

    Originally your code was iterating over the split strings but that will not work as you cannot modify them as you are iterating over the list. It was also assuming that the input to the function would be a string, this is incorrect as instead it will be a pandas.Series object.

    This is a simplified piece of code and does not take into account things like punctuation, that I leave as an exercise to the reader.

    import pandas as pd
    import jellyfish as jf
    
    data1 =  {'A':['rains','plane','tree'],'B':['It rain there','This is a vertical planes','Plant a trees']}
    df1 = pd.DataFrame(data1)
    
    def word_replace(row):
        comp = row['A']
        str1 = row['B']
    
        out = ' '.join([comp if jf.levenshtein_distance(word, comp) == 1
                        else word for word in str1.split()])
        return out
    
    df1['C'] = df1.apply(word_replace, axis=1)