Search code examples
pythonscikit-learnknn

Sklearn: Nearest Neightbour with String-Values and Custom Metric


I have data that looks like the following (all are string values)

>>> all_states[0:3]
[['A','B','Empty'],
 ['A', 'B', 'Empty'],
 ['C', 'D', 'Empty']]

I want to use a custom distance metric

def mydist(x, y):
    return 1
neigh = NearestNeighbors(n_neighbors=5, metric=mydist)

However, when I call

neigh.fit(np.array(all_states))

I get the error

ValueError: Unable to convert array of bytes/strings into decimal numbers with dtype='numeric'

I know that I can use the OneHotEncoder or the LabelEncoder - but can I also do that without encoding the data as I have my own distance metric?


Solution

  • On the help page,

    metrics tr or callable, default=’minkowski’

    The distance metric to usefor the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of DistanceMetric for a list of available metrics. If metric is “precomputed”, X is assumed to be a distance matrix and must be square during fit. X may be a sparse graph, in which case only “nonzero” elements may be considered neighbors.

    You can use pdist documentation and make it squareform as required for the input:

    all_states = [['A','B','Empty'],
     ['A', 'B', 'Empty'],
     ['C', 'D', 'Empty']]
    
    from scipy.spatial.distance import pdist,squareform
    from sklearn.neighbors import NearestNeighbors
    
    dm = squareform(pdist(all_states, mydist))
    dm
    
    array([[0., 1., 1.],
           [1., 0., 1.],
           [1., 1., 0.]])
    
    neigh = NearestNeighbors(n_neighbors=5, metric="precomputed")  
    neigh.fit(dm)