python pandas dataframe levenshtein-distance fuzzy-search

How can I calculate the Levenshtein distance between all rows in two dataframes and output the Levenshtein score for each pair?

I'm trying to calculate the Levenshtein distance between two dataframes (dfa & dfb) as set out below.

dfa:

Name      Addresss     ID  
Name1a    Address1a    ID1a
Name2a    Address2a    ID2a

dfb:

Name      Addresss      ID  
Name1b    Address1b   ID1b
Name2b    Address2b   ID2b

I understand computing the distance between two strings but I am a bit confused as to how I could do a set of columns against another with the output looking something like this with it displaying all pairs and scores:

Output:

Name      Name      LevScore
Name1a    Name1b       0.87
Name1a    Name2b       0.45
Name1a    Name3b       0.26
Name2a    Name1b       0.92
Name2a    Name2b       0.67
Name2a    Name3b       0.56
etc

Thanks in advance!

Manesh

Solution

You can use the package Levenshtein together with itertools to get the combinations of values for the two columns :

import Levenshtein as lev
from itertools import product

new_df = pd.DataFrame(product(df1['Name'], df2['Name']), columns=["Name1","Name2"])

new_df["LevScore"] = new_df.apply(lambda x: lev.score(x[0],x[1]), axis=1)

print(new_df)

    Name1   Name2   LevScore
0   Name1a  Name1b  1
1   Name1a  Name2b  2
2   Name2a  Name1b  2
3   Name2a  Name2b  1

EDIT

Let's say this is your df1:

df1_n = pd.concat([df1,df1,df1]).reset_index(drop=True)
df1_n

Name    Addresss    ID
0   Name1a  Address1a   ID1a
1   Name2a  Address2a   ID2a
2   Name1a  Address1a   ID1a
3   Name2a  Address2a   ID2a
4   Name1a  Address1a   ID1a
5   Name2a  Address2a   ID2a

As you said you can compute the combinations of values taking chunks of size step from df1_n:

fina_df = pd.DataFrame()
step=2
for i in range(0,df1_n.shape[0],step):
    new_df = pd.DataFrame(product(df1_n.iloc[i:i+step,0], df2['Name']), columns=["Name1","Name2"])
    new_df["LevScore"] = new_df.apply(lambda x: lev.distance(x[0],x[1]), axis=1)
    fina_df = pd.concat([fina_df, new_df], axis=0).reset_index(drop=True)

print(final_df)

Output:

Name1   Name2   LevScore
0   Name1a  Name1b  1
1   Name1a  Name2b  2
2   Name2a  Name1b  2
3   Name2a  Name2b  1
4   Name1a  Name1b  1
5   Name1a  Name2b  2
6   Name2a  Name1b  2
7   Name2a  Name2b  1
8   Name1a  Name1b  1
9   Name1a  Name2b  2
10  Name2a  Name1b  2
11  Name2a  Name2b  1

For your situation change 2 to 300 or 500. This should avoid to fill up your entire RAM, let me know if it works!