Search code examples
pythonscikit-learndbscan

DBSCAN - Best way to find the Eps and MinPts for geospatial data (coordinates)


Question: The best way to find out the Eps and MinPts parameters for DBSCAN algorithm?

Problem: The goal is to find the locations (clusters) based on coordinates (input data). The algorithm calculates the most visited areas and retrieves these clusters.

Approach:

I defined the epsilon (EPS) parameter as 1.5 km - converted to radians to be used by the DBSCAN algorithm: epsilon = 1.5 / 6371.0088 (ref to this 1.5 km: https://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/).

If I define the MinPts to a low value (e.g. MinPts = 5, it will produce 2000 clusters), the DBSCAN will produce too many clusters and I want to limit the relevance/size of the clusters to an acceptable value. I use the haversine metric and ball tree algorithm to calculate great-circle distances between points.

Suggestions:

  1. knn approach to find EPS;
  2. domain knowledge and to decide the best values for EPS and MinPts.

Data: I'm using 160k coordinates but the program should be capable to handle different data inputs.


Solution

  • As you may know, setting MinPts high will not only prevent small clusters from forming, but will also change the shape of larger clusters as its outskirts will be considered outliers.

    Consider instead a third way to reduce the number of clusters; simply sort by descending size (number of coordinates) and limit that to 4 or 5. This way, you won't be shown all the small clusters if you're not interested in them, but you can instead treat all those points as noise.

    You're essentially using DBSCAN for something it's not meant for, namely to find the n largest clusters, but that's fine - you just need to "tweak the algorithm" to fit your use case.


    Update

    If you know the entire dataset and it will not change in the future, I would just tweak minPts manually, based on your knowledge.

    In scientific environments and with varying data sets, you consider the data as "generated from a stochastic process". However, that would mean that there is a chance - no matter how small - that there are minPts dogs in a remote forest somewhere at the same time, or minPts - 1 dogs in Central Park, where it's normally overcrowded.

    What I mean by that is that if you go down the scientific road, you need to find a balance between the deterministic value of minPts and the probabilistic distribution of the points in your data set.

    In my experience, it all comes down to whether or not you trust your knowledge, or would like to defer responsibility. In some government/scientific/large corporate positions, it's a safer choice to pin something on an algorithm than on gut feeling. In other situations, it's safe to use gut feeling.