Search code examples
cluster-analysisdbscanelkiepsilon

ELKI DBSCAN LngLatDistanceFunction producing one cluster


I'm using Elki LngLatDistanceFunction to cluster Lon/lat points but it's only returning one cluster (was returning more clusters when I used Euclid distance). I tried multiple Epsilon values but I'm still getting one cluster.

    int minPts=20;
    double eps=10;
    ListParameterization params = new ListParameterization();
    params.addParameter(DBSCAN.DISTANCE_FUNCTION_ID, LngLatDistanceFunction.class);
    params.addParameter(DBSCAN.Parameterizer.MINPTS_ID, minPts);
    params.addParameter(DBSCAN.Parameterizer.EPSILON_ID, eps);

    params.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, dbcon);
    params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID, RStarTreeFactory.class);
    params.addParameter(RStarTreeFactory.Parameterizer.BULK_SPLIT_ID, SortTileRecursiveBulkSplit.class);
    params.addParameter(AbstractPageFileFactory.Parameterizer.PAGE_SIZE_ID, 600);

    Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, params);
    db.initialize();

    GeneralizedDBSCAN dbscan = ClassGenericsUtil.parameterizeOrAbort(GeneralizedDBSCAN.class, params);

Solution

  • The distance is in meters. Therefore, you need to choose epsilon such that some - but not all points - have more than minPts neighbors.

    You can use the KNNDistancesSampler class to estimate the parameter. It is not an automatic estimation. But you can plot the resuling distances, and check for a "knee" in this plot.

    Pay attention to the "noise" flag.

    • If you get a single cluster, and it is "noise", then epsilon is too small.
    • If you get a single cluster, and it is a "cluster" (not noise), then epsilon is too large.
    • If you get a single cluster, and it is "noise", then minPts may be too large.
    • If you get a single cluster, and it is a cluster, then minPts may be too small.

    For most applications, it is easier to fix minPts to 4, or 10, or 20; and then adjust the epsilon parameter as desired. For geographic applications like yours, it may be much easier to fix the epsilon parameter, and vary the minpts parameter instead. For example, you may know that a distance of less than 10000 meter indicates objects to be "neighbors".

    Algorithms such as OPTICS are also helpful to choose the parameter visually. (Use the MiniGUI!)