Search code examples
pythonscikit-learncluster-analysisdbscan

Using DBSCAN to find the most dense cluster?


Ive been looking at Geoff Boeing's excellent blog posts on DBSCAN. The page I'm most interested in is -

http://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/

How can I amend this approach to return the center of the largest cluster(cluster center with the surrounded by the most lat/lng points)? Is there a density rating associated with the center point of each cluster?

The core dbscan -

db = DBSCAN(eps=.01, min_samples=1).fit(coordinates)
labels = db.labels_
num_clusters = len(set(labels)) - (1 if -1 in labels else 0)
clusters = pd.Series([coordinates[labels == i] for i in xrange(num_clusters)])
print('Number of clusters: %d' % num_clusters)

Solution

  • Unfortunately, that blog post is wrong in a number of key points.

    1. Never use DBSCAN with min_samples=1. That is single-linkage clustering. If you want single-linkage, use single-linkage, not DBSCAN. Here, Leader clustering may be a good choice, too.

    2. Choose eps wisely. In his example, he chose eps so small, he mostly removed (near-) duplicates...

    3. DBSCAN clusters do not have a meaningful center. Because they can be non-convex.In particular, the center needs to take haversine distance into account, which he doesn't. The first version used the mean, the new version uses the point closest to the mean (but which may still be distorted, because the mean didn't take earth into account).

    4. On latitude, longitude you should be using great-circle distance during clustering already, not only afterwards. (Fixed in the blog by now).

    Point 3 above also answers your question: DBSCAN clusters may not have a meaningful center. The center can be outside the cluster.

    Since the original post, some points (in particular #4) have been improved. DBSCAN now actually uses haversine, and a ball tree index.