Sklearn: Nearest Neightbour with String-Values and Custom Metric

I have data that looks like the following (all are string values)

>>> all_states[0:3]
[['A','B','Empty'],
 ['A', 'B', 'Empty'],
 ['C', 'D', 'Empty']]

I want to use a custom distance metric

def mydist(x, y):
    return 1
neigh = NearestNeighbors(n_neighbors=5, metric=mydist)

However, when I call

neigh.fit(np.array(all_states))

I get the error

ValueError: Unable to convert array of bytes/strings into decimal numbers with dtype='numeric'

I know that I can use the OneHotEncoder or the LabelEncoder - but can I also do that without encoding the data as I have my own distance metric?

Solution

On the help page,

metrics tr or callable, default=’minkowski’

The distance metric to usefor the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of DistanceMetric for a list of available metrics. If metric is “precomputed”, X is assumed to be a distance matrix and must be square during fit. X may be a sparse graph, in which case only “nonzero” elements may be considered neighbors.

You can use pdist documentation and make it squareform as required for the input:

all_states = [['A','B','Empty'],
 ['A', 'B', 'Empty'],
 ['C', 'D', 'Empty']]

from scipy.spatial.distance import pdist,squareform
from sklearn.neighbors import NearestNeighbors

dm = squareform(pdist(all_states, mydist))
dm

array([[0., 1., 1.],
       [1., 0., 1.],
       [1., 1., 0.]])

neigh = NearestNeighbors(n_neighbors=5, metric="precomputed")  
neigh.fit(dm)