I'm puzzeled about how does cosine
metric works in sklearn's clustering algorithoms.
For example, DBSCAN has a parameter eps
and it specified maximum distance when clustering. However, bigger cosine similarity means two vectors are closer, which is just the opposite to our distance
concept.
I found that there are cosine_similarity
and cosine_distance
( just 1-cos()
) in pairwise_metric, and when we specified the metric is cosine
we use cosine_similarity
.
So, when clustering, how does DBSCAN compares the cosine_similarity and @param eps
to decide whether two vectors have the same label?
An example
import numpy as np from sklearn.cluster import DBSCAN samples = [[1, 0], [0, 1], [1, 1], [2, 2]] clf = DBSCAN(metric='cosine', eps=0.1) result = clf.fit_predict(samples) print(result)
it outputs [-1, -1, -1, -1] which means these four points are in the same cluster
However,
for points pair [1,1], [2, 2]
,
for points pair[1,1], [1,0]
,
eps
0.1, why DBSCAN clustered them into the same cluster?Thanks for @Stanislas Morbieu 's answer, and I finally understand the cosine
metric means cosine_distance
which is 1-cosine
The implementation of DBSCAN in scikit-learn rely on NearestNeighbors (see the implementation of DBSCAN).
Here is an example to see how it works with cosine metric:
import numpy as np
from sklearn.neighbors import NearestNeighbors
samples = [[1, 0], [0, 1], [1, 1], [2, 2]]
neigh = NearestNeighbors(radius=0.1, metric='cosine')
neigh.fit(samples)
rng = neigh.radius_neighbors([[1, 1]])
print([samples[i] for i in rng[1][0]])
It outputs [[1, 1], [2, 2]]
, i.e. the points which are closest to [1, 1]
in a radius of 0.1
.
So points which have a cosine distance smaller than eps
in DBSCAN tend to be in the same cluster.
The parameter min_samples
of DBSCAN plays an important role. Since by default, it is set to 5
, no points can be considered as core point.
Setting it to 1
, the example code:
import numpy as np
from sklearn.cluster import DBSCAN
samples = [[1, 0], [0, 1], [1, 1], [2, 2]]
clf = DBSCAN(metric='cosine', eps=0.1, min_samples=1)
result = clf.fit_predict(samples)
print(result)
outputs [0 1 2 2]
which means that [1, 1] and [2, 2] are in the same cluster (numbered 2).
By the way, the output [-1, -1, -1, -1]
doesn't mean that points are in the same cluster, but that all points are in no cluster.