Say I have two arrays:
import numpy as np
arr1 = np.array(['faucet', 'faucets', 'bath', 'parts', 'bathroom'])
arr2 = np.array(['faucett', 'faucetd', 'bth', 'kichen'])
and I want to compute the similarity of the strings in arr2
to the strings in arr1
is an array of correctly spelled words.
is an array of words not recognized in a dictionary of words.
I want to return a matrix which will then be turned into a pandas DataFrame.
My current solution (credit):
from scipy.spatial.distance import pdist, squareform
from Levenshtein import ratio
arr3 = np.concatenate((arr1, arr2)).reshape(-1,1)
matrix = squareform(pdist(arr3, lambda x,y: ratio(x[0], y[0])))
df = pd.DataFrame(matrix, index=arr3.ravel(), columns=arr3.ravel())
faucet faucets bath parts bathroom faucett \
faucet 0.000000 0.923077 0.400000 0.363636 0.285714 0.923077
faucets 0.923077 0.000000 0.363636 0.500000 0.266667 0.857143
bath 0.400000 0.363636 0.000000 0.444444 0.666667 0.363636
parts 0.363636 0.500000 0.444444 0.000000 0.307692 0.333333
bathroom 0.285714 0.266667 0.666667 0.307692 0.000000 0.266667
faucett 0.923077 0.857143 0.363636 0.333333 0.266667 0.000000
faucetd 0.923077 0.857143 0.363636 0.333333 0.266667 0.857143
bth 0.222222 0.200000 0.857143 0.250000 0.545455 0.200000
kichen 0.333333 0.307692 0.200000 0.000000 0.142857 0.307692
faucetd bth kichen
faucet 0.923077 0.222222 0.333333
faucets 0.857143 0.200000 0.307692
bath 0.363636 0.857143 0.200000
parts 0.333333 0.250000 0.000000
bathroom 0.266667 0.545455 0.142857
faucett 0.857143 0.200000 0.307692
faucetd 0.000000 0.200000 0.307692
bth 0.200000 0.000000 0.222222
kichen 0.307692 0.222222 0.000000
The problem with this solution: I waste time computing pairwise distance ratios on words I already know are correctly spelled.
What I'd like is to hand a function arr1
and arr2
(which can be different lengths!) and output a matrix (not necessarily square) with the ratios.
The result would look like this (without the computational overhead):
>>> df.drop(index=arr1, columns=arr2)
faucet faucets bath parts bathroom
faucett 0.923077 0.857143 0.363636 0.333333 0.266667
faucetd 0.923077 0.857143 0.363636 0.333333 0.266667
bth 0.222222 0.200000 0.857143 0.250000 0.545455
kichen 0.333333 0.307692 0.200000 0.000000 0.142857
I think you're looking for cdist
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
from Levenshtein import ratio
arr1 = np.array(['faucet', 'faucets', 'bath', 'parts', 'bathroom'])
arr2 = np.array(['faucett', 'faucetd', 'bth', 'kichen'])
matrix = cdist(arr2.reshape(-1, 1), arr1.reshape(-1, 1), lambda x, y: ratio(x[0], y[0]))
df = pd.DataFrame(data=matrix, index=arr2, columns=arr1)
faucet faucets bath parts bathroom
faucett 0.923077 0.857143 0.363636 0.333333 0.266667
faucetd 0.923077 0.857143 0.363636 0.333333 0.266667
bth 0.222222 0.200000 0.857143 0.250000 0.545455
kichen 0.333333 0.307692 0.200000 0.000000 0.142857