Search code examples
pythonmatrixscipydistancelevenshtein-distance

Return Similarity Matrix From Two Variable-length Arrays of Strings (scipy option?)


Say I have two arrays:

import numpy as np
arr1 = np.array(['faucet', 'faucets', 'bath', 'parts', 'bathroom'])
arr2 = np.array(['faucett', 'faucetd', 'bth', 'kichen'])

and I want to compute the similarity of the strings in arr2 to the strings in arr1.

arr1 is an array of correctly spelled words.

arr2 is an array of words not recognized in a dictionary of words.

I want to return a matrix which will then be turned into a pandas DataFrame.

My current solution (credit):

from scipy.spatial.distance import pdist, squareform
from Levenshtein import ratio
arr3 = np.concatenate((arr1, arr2)).reshape(-1,1)
matrix = squareform(pdist(arr3, lambda x,y: ratio(x[0], y[0])))
df = pd.DataFrame(matrix, index=arr3.ravel(), columns=arr3.ravel())

Output:

            faucet   faucets      bath     parts  bathroom   faucett  \
faucet    0.000000  0.923077  0.400000  0.363636  0.285714  0.923077   
faucets   0.923077  0.000000  0.363636  0.500000  0.266667  0.857143   
bath      0.400000  0.363636  0.000000  0.444444  0.666667  0.363636   
parts     0.363636  0.500000  0.444444  0.000000  0.307692  0.333333   
bathroom  0.285714  0.266667  0.666667  0.307692  0.000000  0.266667   
faucett   0.923077  0.857143  0.363636  0.333333  0.266667  0.000000   
faucetd   0.923077  0.857143  0.363636  0.333333  0.266667  0.857143   
bth       0.222222  0.200000  0.857143  0.250000  0.545455  0.200000   
kichen    0.333333  0.307692  0.200000  0.000000  0.142857  0.307692   

           faucetd       bth    kichen  
faucet    0.923077  0.222222  0.333333  
faucets   0.857143  0.200000  0.307692  
bath      0.363636  0.857143  0.200000  
parts     0.333333  0.250000  0.000000  
bathroom  0.266667  0.545455  0.142857  
faucett   0.857143  0.200000  0.307692  
faucetd   0.000000  0.200000  0.307692  
bth       0.200000  0.000000  0.222222  
kichen    0.307692  0.222222  0.000000

The problem with this solution: I waste time computing pairwise distance ratios on words I already know are correctly spelled.

What I'd like is to hand a function arr1 and arr2 (which can be different lengths!) and output a matrix (not necessarily square) with the ratios.

The result would look like this (without the computational overhead):

>>> df.drop(index=arr1, columns=arr2)

           faucet   faucets      bath     parts  bathroom
faucett  0.923077  0.857143  0.363636  0.333333  0.266667
faucetd  0.923077  0.857143  0.363636  0.333333  0.266667
bth      0.222222  0.200000  0.857143  0.250000  0.545455
kichen   0.333333  0.307692  0.200000  0.000000  0.142857

Solution

  • I think you're looking for cdist:

    import pandas as pd
    import numpy as np
    from scipy.spatial.distance import cdist
    from Levenshtein import ratio
    
    arr1 = np.array(['faucet', 'faucets', 'bath', 'parts', 'bathroom'])
    arr2 = np.array(['faucett', 'faucetd', 'bth', 'kichen'])
    
    matrix = cdist(arr2.reshape(-1, 1), arr1.reshape(-1, 1), lambda x, y: ratio(x[0], y[0]))
    df = pd.DataFrame(data=matrix, index=arr2, columns=arr1)
    

    Result:

               faucet   faucets      bath     parts  bathroom
    faucett  0.923077  0.857143  0.363636  0.333333  0.266667
    faucetd  0.923077  0.857143  0.363636  0.333333  0.266667
    bth      0.222222  0.200000  0.857143  0.250000  0.545455
    kichen   0.333333  0.307692  0.200000  0.000000  0.142857