I have data that looks like the following (all are string values)
>>> all_states[0:3]
[['A','B','Empty'],
['A', 'B', 'Empty'],
['C', 'D', 'Empty']]
I want to use a custom distance metric
def mydist(x, y):
return 1
neigh = NearestNeighbors(n_neighbors=5, metric=mydist)
However, when I call
neigh.fit(np.array(all_states))
I get the error
ValueError: Unable to convert array of bytes/strings into decimal numbers with dtype='numeric'
I know that I can use the OneHotEncoder
or the LabelEncoder
- but can I also do that without encoding the data as I have my own distance metric?
On the help page,
metrics tr or callable, default=’minkowski’
The distance metric to usefor the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of DistanceMetric for a list of available metrics. If metric is “precomputed”, X is assumed to be a distance matrix and must be square during fit. X may be a sparse graph, in which case only “nonzero” elements may be considered neighbors.
You can use pdist documentation and make it squareform as required for the input:
all_states = [['A','B','Empty'],
['A', 'B', 'Empty'],
['C', 'D', 'Empty']]
from scipy.spatial.distance import pdist,squareform
from sklearn.neighbors import NearestNeighbors
dm = squareform(pdist(all_states, mydist))
dm
array([[0., 1., 1.],
[1., 0., 1.],
[1., 1., 0.]])
neigh = NearestNeighbors(n_neighbors=5, metric="precomputed")
neigh.fit(dm)