Search code examples
scikit-learnmetricsdbscan

How does `cosine` metric works in sklearn's clustering algorithoms?


I'm puzzeled about how does cosine metric works in sklearn's clustering algorithoms.

For example, DBSCAN has a parameter eps and it specified maximum distance when clustering. However, bigger cosine similarity means two vectors are closer, which is just the opposite to our distance concept.

I found that there are cosine_similarity and cosine_distance( just 1-cos() ) in pairwise_metric, and when we specified the metric is cosine we use cosine_similarity.

So, when clustering, how does DBSCAN compares the cosine_similarity and @param eps to decide whether two vectors have the same label?

An example

import numpy as np
from sklearn.cluster import DBSCAN

samples = [[1, 0], [0, 1], [1, 1], [2, 2]]

clf = DBSCAN(metric='cosine', eps=0.1)

result = clf.fit_predict(samples)

print(result)

it outputs [-1, -1, -1, -1] which means these four points are in the same cluster

However,

  • for points pair [1,1], [2, 2],

    1. its cosine_similarity is 4/(4) = 1,
    2. the cosine distance will be 1-1 = 0, so they are in the same cluster
  • for points pair[1,1], [1,0],

    1. its cosine_similarity is 1/sqrt(2),
    2. the cosine distance will be 1-1/sqrt(2) = 0.29289321881345254, this distance is bigger than our eps 0.1, why DBSCAN clustered them into the same cluster?

Thanks for @Stanislas Morbieu 's answer, and I finally understand the cosine metric means cosine_distance which is 1-cosine


Solution

  • The implementation of DBSCAN in scikit-learn rely on NearestNeighbors (see the implementation of DBSCAN).

    Here is an example to see how it works with cosine metric:

    import numpy as np
    from sklearn.neighbors import NearestNeighbors
    
    samples = [[1, 0], [0, 1], [1, 1], [2, 2]]
    neigh = NearestNeighbors(radius=0.1, metric='cosine')
    neigh.fit(samples) 
    
    rng = neigh.radius_neighbors([[1, 1]])
    print([samples[i] for i in rng[1][0]])
    

    It outputs [[1, 1], [2, 2]], i.e. the points which are closest to [1, 1] in a radius of 0.1.

    So points which have a cosine distance smaller than eps in DBSCAN tend to be in the same cluster.

    The parameter min_samples of DBSCAN plays an important role. Since by default, it is set to 5, no points can be considered as core point. Setting it to 1, the example code:

    import numpy as np
    from sklearn.cluster import DBSCAN
    
    samples = [[1, 0], [0, 1], [1, 1], [2, 2]]
    
    clf = DBSCAN(metric='cosine', eps=0.1, min_samples=1)
    
    result = clf.fit_predict(samples)
    
    print(result)
    

    outputs [0 1 2 2] which means that [1, 1] and [2, 2] are in the same cluster (numbered 2).

    By the way, the output [-1, -1, -1, -1] doesn't mean that points are in the same cluster, but that all points are in no cluster.