Search code examples
pythonpandasdataframesimilaritylevenshtein-distance

Compute Levenshtein Distance between two String Columns from two different dataframes


I have two dataframes which contain the same string column (hostname) and I want to compute the Levenshtein distance between all possible combinations of hostnames pairs between both dataframes and put the results in a third dataframe where I keep the distance for each possible combination but also both indexes of this combination.

For example, let's say I have these two dataframes:

Index      Hostname
85608             dlt-rly-tracker-3.datto.com
9378      lnv7bc4241e2.1528.ozvision.ozsn.net
22791             dlt-rly-tracker-1.datto.com
88922                                 pw-file
94560     lnv7bc4241e2.1528.ozvision.ozsn.net
13245                                       -
63604                                 pw-file
435839                                pw-file
95473                                       -
13856                                 pw-file
210705                                pw-file
30046                                       -
106917            dlt-rly-tracker-2.datto.com
415925                                pw-file
170471                                pw-file
73971                                       -
86885             dlt-rly-tracker-3.datto.com
162764                                pw-file
74791                                 pw-file

and the second dataframe:

Index     Hostname
93358                  device.dattobackup.com
34067             dlt-rly-tracker-5.datto.com
18083               46.104.89.54.in-addr.arpa
96798                                 pw-file
130940                                pw-file
31476     lnv7bc4241e2.1528.ozvision.ozsn.net
149723                                pw-file
52901                                       -
308834    lnv7bc4241e2.1528.ozvision.ozsn.net
24196                                 pw-file
69038                                       -
244454    lnv7bc4241e2.1528.ozvision.ozsn.net
2867                                        -
45549                        daisy.ubuntu.com
334378                                pw-file
86006               46.104.89.54.in-addr.arpa
430257                                pw-file
86150               46.104.89.54.in-addr.arpa
65189                                 pw-file

What I want to do is to take the first value of hostname (dlt-rly-tracker-3.datto.com) and compute the levenshtein distance with the all values of hostnames from the second dataframe (one by one). Store the results at the end of this process in a new dataframe that will look similar to the following:

Indexes         Distance    Hostnames
85608-93358     23          dlt-rly-tracker-3.datto.com,device.dattobackup.com
85608-34067     60          dlt-rly-tracker-3.datto.com,dlt-rly-tracker-5.datto.com

I really appreciate any help to solve my problem. Thanks.


Solution

  • The solution below will loop through both data frames and create a new dictionary with the data you want. You should then convert this dictionary to a data frame. Let me know if this helps!

     dist = {}
     for rowname, row in df.iterrows(): 
          for rowname1, row1 in df1.iterrows(): 
                L = Levenstein(row.Hostname, row1.Hostname)
                dist.update( {rowname+’-‘+rowname1 : (L, row.Hostname+’,’+row1.Hostname} )