I'm trying to calculate the Levenshtein distance between two dataframes (dfa & dfb) as set out below.
dfa:
Name Addresss ID
Name1a Address1a ID1a
Name2a Address2a ID2a
dfb:
Name Addresss ID
Name1b Address1b ID1b
Name2b Address2b ID2b
I understand computing the distance between two strings but I am a bit confused as to how I could do a set of columns against another with the output looking something like this with it displaying all pairs and scores:
Output:
Name Name LevScore
Name1a Name1b 0.87
Name1a Name2b 0.45
Name1a Name3b 0.26
Name2a Name1b 0.92
Name2a Name2b 0.67
Name2a Name3b 0.56
etc
Thanks in advance!
Manesh
You can use the package Levenshtein
together with itertools
to get the combinations of values for the two columns :
import Levenshtein as lev
from itertools import product
new_df = pd.DataFrame(product(df1['Name'], df2['Name']), columns=["Name1","Name2"])
new_df["LevScore"] = new_df.apply(lambda x: lev.score(x[0],x[1]), axis=1)
print(new_df)
Name1 Name2 LevScore
0 Name1a Name1b 1
1 Name1a Name2b 2
2 Name2a Name1b 2
3 Name2a Name2b 1
EDIT
Let's say this is your df1:
df1_n = pd.concat([df1,df1,df1]).reset_index(drop=True)
df1_n
Name Addresss ID
0 Name1a Address1a ID1a
1 Name2a Address2a ID2a
2 Name1a Address1a ID1a
3 Name2a Address2a ID2a
4 Name1a Address1a ID1a
5 Name2a Address2a ID2a
As you said you can compute the combinations of values taking chunks of size step
from df1_n
:
fina_df = pd.DataFrame()
step=2
for i in range(0,df1_n.shape[0],step):
new_df = pd.DataFrame(product(df1_n.iloc[i:i+step,0], df2['Name']), columns=["Name1","Name2"])
new_df["LevScore"] = new_df.apply(lambda x: lev.distance(x[0],x[1]), axis=1)
fina_df = pd.concat([fina_df, new_df], axis=0).reset_index(drop=True)
print(final_df)
Output:
Name1 Name2 LevScore
0 Name1a Name1b 1
1 Name1a Name2b 2
2 Name2a Name1b 2
3 Name2a Name2b 1
4 Name1a Name1b 1
5 Name1a Name2b 2
6 Name2a Name1b 2
7 Name2a Name2b 1
8 Name1a Name1b 1
9 Name1a Name2b 2
10 Name2a Name1b 2
11 Name2a Name2b 1
For your situation change 2 to 300 or 500. This should avoid to fill up your entire RAM, let me know if it works!