Search code examples
pythonparametersscikit-learnrangedbscan

How to define a range of values for the eps parameter of sklearn.cluster.DBSCAN?


I want to use DBSCAN with the metric sklearn.metrics.pairwise.cosine_similarity to cluster points that have cosine similarity close to 1 (i.e. whose vectors (from "the" origin) are parallel or almost parallel).

The issue:

eps is the maximum distance between two samples for them to be considered as in the same neighbourhood by DBSCAN - meaning that if the distance between two points is lower than or equal to eps, these points are considered neighbours;

but

sklearn.metrics.pairwise.cosine_similarity spits out values between -1 and 1 and I want DBSCAN to consider two points to be neighbours if the distance between them is, say, between 0.75 and 1 - i.e. greater than or equal to 0.75.

I see two possible solutions:

  1. pass a range of values to the eps parameter of DBSCAN e.g. eps=[0.75,1]

  2. Pass the value eps=-0.75 to DBSCAN but (somehow) force it to use the negative of the cosine similarities matrix that is spit out by sklearn.metrics.pairwise.cosine_similarity

I do not know how to implement either of these.

Any guidance would be appreciated!


Solution

  • DBSCAN has a metric keyword argument. Docstring:

    metric : string, or callable The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is "precomputed", X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only "nonzero" elements may be considered neighbors for DBSCAN.

    So probably the easiest thing to do is to precompute a distance matrix using cosine similarity as your distance metric, preprocess the distance matrix such that it fits your bespoke distance criterion (probably something like D = np.abs(np.abs(CD) -1), where CD is your cosine distance matrix), and then set metric to precomputed, and pass the precomputed distance matrix D in for X, i.e. the data.

    For example:

    #!/usr/bin/env python
    
    import numpy as np
    
    from sklearn.metrics.pairwise import cosine_similarity
    from sklearn.cluster import DBSCAN
    
    total_samples = 1000
    dimensionality = 3
    points = np.random.rand(total_samples, dimensionality)
    
    cosine_distance = cosine_similarity(points)
    
    # option 1) vectors are close to each other if they are parallel
    bespoke_distance = np.abs(np.abs(cosine_distance) -1)
    
    # option 2) vectors are close to each other if they point in the same direction
    bespoke_distance = np.abs(cosine_distance - 1)
    
    results = DBSCAN(metric='precomputed', eps=0.25).fit(bespoke_distance)