Search code examples
pythonpandasdataframelevenshtein-distance

How to calculate Levenshtein ratio/distance for rows in my column in python?


I have a dataframe with only one column , and 1000 rows in that column. I need to compare all rows and find Levenshtein distance for all rows . how Do i calculate that ratio or distance in python

I have a dataframe as following:

  #Df 
  StepDescription
  click confirm button when done
  you have logged on
  please log in to proceed
  click on confirm button
  Dolb was released successfully
  Enter your details
  validate the statement
  Aval was released sucessfully

How to do i Calculate Levenshtein ration for all these

Code I have written to iterate through loops but after iterating how to proceed.

  import Levenshtein
  import pandas as pd
  data_dist = pd.read_csv('path\Data_TestDescription.csv')
  df = pd.DataFrame(data_dist)
  for index, row in df.iterrows():

Solution

  • As asked in a comment, the percentage is desired, I'll keep the accepteds answer and add just the new part:

    import numpy as np
    import pandas as pd
    from Levenshtein import distance
    from itertools import product
    
    #df = ...
    
    dist = [distance(*x) for x in product(df.StepDescription, repeat=2)]
    
    dist_df = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]))
    dist_df
    
        0   1   2   3   4   5   6   7
    0   0  23  23  13  29  25  25  28
    1  23   0  18  18  23  18  18  23
    2  23  18   0  20  25  21  19  24
    3  13  18  20   0  27  19  21  26
    4  29  23  25  27   0  26  23   5
    5  25  18  21  19  26   0  19  25
    6  25  18  19  21  23  19   0  21
    7  28  23  24  26   5  25  21   0
    
    dist_df_percentage = dist_df // min(x for x in dist if x > 0) * 100
    
         0    1    2    3    4    5    6    7
    0    0  460  460  260  580  500  500  560
    1  460    0  360  360  460  360  360  460
    2  460  360    0  400  500  420  380  480
    3  260  360  400    0  540  380  420  520
    4  580  460  500  540    0  520  460  100
    5  500  360  420  380  520    0  380  500
    6  500  360  380  420  460  380    0  420
    7  560  460  480  520  100  500  420    0