Search code examples
pythonparameterssklearn-pandasdbscan

How to find optimal parametrs for DBSCAN?


Is there any tool which calculates optimal value for minpts and eps for DBSCAN algorithm?

Currently i use sklearn library to apply DBSCAN algorithm

from sklearn.cluster import DBSCAN

I tried algorithm with several minpts and eps but without any calculation.


Solution

  • eps and minpts are both considered hyperparameters. There are no algorithms to determine the perfect values for these, given a dataset. Instead, they must be optimized largely based on the problem you are trying to solve.

    Some ideas on how to optimize:

    minpts should be larger as the size of the dataset increases.

    eps is a value that deals with the radius of the clusters you are trying to find. To choose a value, we can perform a sort of elbowing technique (a similar technique that is often used to determine an optimal k in K-Means clustering).

    1. Let k = the number of nearest neighbors
    2. For a value of k, for each point in a dataset, calculate the average distance between each point and its k-nearest neighbors (some packages have this function built in somewhere)
    3. Plot number of points on the X axis and average distances on the y axis that you calculated.
    4. The resulting graph should be increasing (as long as you sort your arrays increasingly by average distance) and concave up. There should be a point where the rate of increase jumps drastically, this point is called the elbow point and contains your optimal eps, which is the y value of the elbow point.
    5. Run this algorithm using different values of k and compare results.

    If there was a definite way to solve for optimal values, it would be largley documented. For now, all we can do is give our best calculated guess. Once again, the problem you are trying to solve may affect the way you choose your elbow point - it is important to understand that.