I have two dataframes which contain the same string column (hostname) and I want to compute the Levenshtein distance between all possible combinations of hostnames pairs between both dataframes and put the results in a third dataframe where I keep the distance for each possible combination but also both indexes of this combination.
For example, let's say I have these two dataframes:
Index Hostname
85608 dlt-rly-tracker-3.datto.com
9378 lnv7bc4241e2.1528.ozvision.ozsn.net
22791 dlt-rly-tracker-1.datto.com
88922 pw-file
94560 lnv7bc4241e2.1528.ozvision.ozsn.net
13245 -
63604 pw-file
435839 pw-file
95473 -
13856 pw-file
210705 pw-file
30046 -
106917 dlt-rly-tracker-2.datto.com
415925 pw-file
170471 pw-file
73971 -
86885 dlt-rly-tracker-3.datto.com
162764 pw-file
74791 pw-file
and the second dataframe:
Index Hostname
93358 device.dattobackup.com
34067 dlt-rly-tracker-5.datto.com
18083 46.104.89.54.in-addr.arpa
96798 pw-file
130940 pw-file
31476 lnv7bc4241e2.1528.ozvision.ozsn.net
149723 pw-file
52901 -
308834 lnv7bc4241e2.1528.ozvision.ozsn.net
24196 pw-file
69038 -
244454 lnv7bc4241e2.1528.ozvision.ozsn.net
2867 -
45549 daisy.ubuntu.com
334378 pw-file
86006 46.104.89.54.in-addr.arpa
430257 pw-file
86150 46.104.89.54.in-addr.arpa
65189 pw-file
What I want to do is to take the first value of hostname (dlt-rly-tracker-3.datto.com) and compute the levenshtein distance with the all values of hostnames from the second dataframe (one by one). Store the results at the end of this process in a new dataframe that will look similar to the following:
Indexes Distance Hostnames
85608-93358 23 dlt-rly-tracker-3.datto.com,device.dattobackup.com
85608-34067 60 dlt-rly-tracker-3.datto.com,dlt-rly-tracker-5.datto.com
I really appreciate any help to solve my problem. Thanks.
The solution below will loop through both data frames and create a new dictionary with the data you want. You should then convert this dictionary to a data frame. Let me know if this helps!
dist = {}
for rowname, row in df.iterrows():
for rowname1, row1 in df1.iterrows():
L = Levenstein(row.Hostname, row1.Hostname)
dist.update( {rowname+’-‘+rowname1 : (L, row.Hostname+’,’+row1.Hostname} )