Search code examples
pythonpandasdataframelevenshtein-distancefuzzy-search

How can I calculate the Levenshtein distance between all rows in two dataframes and output the Levenshtein score for each pair?


I'm trying to calculate the Levenshtein distance between two dataframes (dfa & dfb) as set out below.

dfa:

Name      Addresss     ID  
Name1a    Address1a    ID1a
Name2a    Address2a    ID2a

dfb:

Name      Addresss      ID  
Name1b    Address1b   ID1b
Name2b    Address2b   ID2b

I understand computing the distance between two strings but I am a bit confused as to how I could do a set of columns against another with the output looking something like this with it displaying all pairs and scores:

Output:

Name      Name      LevScore
Name1a    Name1b       0.87
Name1a    Name2b       0.45
Name1a    Name3b       0.26
Name2a    Name1b       0.92
Name2a    Name2b       0.67
Name2a    Name3b       0.56
etc

Thanks in advance!

Manesh


Solution

  • You can use the package Levenshtein together with itertools to get the combinations of values for the two columns :

    import Levenshtein as lev
    from itertools import product
    
    new_df = pd.DataFrame(product(df1['Name'], df2['Name']), columns=["Name1","Name2"])
    
    new_df["LevScore"] = new_df.apply(lambda x: lev.score(x[0],x[1]), axis=1)
    
    print(new_df)
    
        Name1   Name2   LevScore
    0   Name1a  Name1b  1
    1   Name1a  Name2b  2
    2   Name2a  Name1b  2
    3   Name2a  Name2b  1
    

    EDIT

    Let's say this is your df1:

    df1_n = pd.concat([df1,df1,df1]).reset_index(drop=True)
    df1_n
    
    Name    Addresss    ID
    0   Name1a  Address1a   ID1a
    1   Name2a  Address2a   ID2a
    2   Name1a  Address1a   ID1a
    3   Name2a  Address2a   ID2a
    4   Name1a  Address1a   ID1a
    5   Name2a  Address2a   ID2a
    

    As you said you can compute the combinations of values taking chunks of size step from df1_n:

    fina_df = pd.DataFrame()
    step=2
    for i in range(0,df1_n.shape[0],step):
        new_df = pd.DataFrame(product(df1_n.iloc[i:i+step,0], df2['Name']), columns=["Name1","Name2"])
        new_df["LevScore"] = new_df.apply(lambda x: lev.distance(x[0],x[1]), axis=1)
        fina_df = pd.concat([fina_df, new_df], axis=0).reset_index(drop=True)
    
    print(final_df)
    

    Output:

    Name1   Name2   LevScore
    0   Name1a  Name1b  1
    1   Name1a  Name2b  2
    2   Name2a  Name1b  2
    3   Name2a  Name2b  1
    4   Name1a  Name1b  1
    5   Name1a  Name2b  2
    6   Name2a  Name1b  2
    7   Name2a  Name2b  1
    8   Name1a  Name1b  1
    9   Name1a  Name2b  2
    10  Name2a  Name1b  2
    11  Name2a  Name2b  1
    

    For your situation change 2 to 300 or 500. This should avoid to fill up your entire RAM, let me know if it works!