Search code examples
pythonscikit-learndbscan

Multiple eps values in sklearn DBSCAN


I want to use the DBSCAN implementation from sklearn. They allow you to use a custom distance metric but only one eps values. What I want is the following:

Lets say my points have 3 features each, so each point can be considered as a numpy array of the form p=np.array([p1,p2,p3]). Two points p and q are neighbors if np.abs(p1-q1) < eps1 and np.abs(p2-q2) < eps2 and np.abs(p3-q3) < eps3. Usually, one would use d(p,q)<eps, where d(,) is a metric and eps a threshold.

Is there a way to implement my needs easily into sklearn?


Solution

  • You can scale appropriately, and then use maximum norm.

    p = p * [1/eps1, 1/eps2, 1/eps3]
    
    c = sklearn.cluster.DBSCAN(eps=1, metric="chebyshev", ...)
    

    Note that DBSCAN uses <= not <.

    Or you precompute a binary "distance" matrix, where the distance is 0 if the three conditions hold, and 1 otherwise. But that needs O(n²) memory.