I'm working with DBSCAN libraries to extract clusters from a set of data. So far I've tested DBSCAN using Apache Common Math and WEKA libraries. (My question is not about which libraries are available with implementations of DBSCAN)
So far I've understood that in DBSCAN there are 3 types of points (as according with Wikipedia): core points, (density-)reachable points and outliers. My issue is that I need to extract the clusters and it's frontier points or its density-reachable points.
Do you know any DBSCAN library that allows me to extract the density-reachable points per cluster?
In the ELKI implementation, you can use the options
-algorithm clustering.gdbscan.GeneralizedDBSCAN -gdbscan.core-model
to get a cluster "model" containing the core points of the cluster only. The cluster members still are the border points - density reachable, but not core. However, this needs more memory, so it is not enabled by default.
In this image, the inner convex hull are the core points only. For the green cluster, there are just two core points. For the noise points, there is no nested cluster, obviously.
Note that DBSCAN clusters can be non-convex. This is why the green cluster can have core points inside the convex hull of the red cluster. Not every point inside the inner hull is a core point. There even is a noise point right inside of the red cluster, and that is not an error - the data set is too sparse, it has too much local density variations with this epsilon and minPts. Any point in the vincinity of this noise point cannot be a core point; but any point of the inner convex hull is one for sure.
The Cluster
objects will provide you with a full list of points, not just the convex hull. The core points are accessible via the clusters CoreObjectsModel
. Only the visualization code uses the convex hulls to avoid cluttering the image too much. Also, the default output writer does currently not output this information. You will need to use Java, and either write a custom ResultHandler
to output the data as desired, or even do everything in ELKI.
Note that the distinction between border points, noise points and core points is considered obsolete and not well supported by theoretical models in newer literature.