I have a seemingly trivial question. I need someone to clarify for me the meaning of the DBSCAN MinPts parameter in the ELKI implementation.
If I use a value of k = 4 to plot a sorted k-dist graph, it means the distance of a point p to its 4th nearest neighbor. This means that the neighborhood contains 5 points (k + 1); the 4 neighbors plus point p.
In ELKI, does MinPts mean the neighbors only or does it also include the point p? In the above case, should it be set to 4 or 5?
The original DBSCAN paper (Ester et al. 1996) talks of setting MinPts to k (MinPts = 4). The DBSCAN Wikipedia article also seems to imply that MinPts refers to the neighbors around p. However, ELKI seems to expect that MinPts is set to k + 1 (MinPts = 5).
Someone kindly clarify.
If you are in a database context, and you send a query to the database
Select all objects within a radius of r around coordinates x,y,z
then the database would include the query point, if it is stored in the database. In particular, you could easily remove it, if it was not desired to be included. From a database point of view, queries should include the query point, if it is in the database, and not, if it is not stored in the database.
Even more, if you do density estimation, then every data point should contribute to the density, shouldn't it? Why would one point be special? What about other points with the exact same coordinates? What if you estimate density at a point that is not in the database? You would see an abrupt density increase if you move a tiny bit away from your query point!
If you try to define the k-nearest-neighbors as a query to a database D, and do not require the query point x to be part of the database, then it follows naturally that the result should include the query point if it is part of D.
On the other hand, it is counterintuitive that the 1-nearest-neighbor usually is the query point. Usually, when you are looking for "the nearest neighbor", you do mean "the nearest other object", unfortunately. Even if this would formally translate to "nearest object to the coordinates of my query point in my database without my query point".
Unfortunately, this is not used consistently in literature. Some articles/authors/applications do—and some do not—include the query point. I can name plenty of examples from literature for both cases.
Even a single article sometimes includes the query point in one figure, but not in the other!
There will never be a solution that behaves according to the expectations of everybody, because people do have different ideas of what is "correct", unfortunately.
You will have to decide on what you want the behaviour to be, and double check everything if it behaves the way you expect it to be. Document your decisions and observations.
Please check yourself if the implementation of the k-distance graph in ELKI includes the query point. We may even (have) changed the behavior of this class for version 0.7 or 0.8; so it may be different for me than it is for you. Really, really look at the source of the exact version that you are using.
If the k-distance graph does not include the query point, you would need to use the 3-distance for minPts=4
. If it does include the query point, then 4-distance agrees with minPts=4
. I'm pretty sure that DBSCAN does count the query point for above reasons (database point of view, density estimation point of view). Thus for DBSCAN, minPts=1 is nonsense (every point is a core point), and minPts=2 is single-linkage clustering (any epsilon-neighbors are merged). Only at minPts > 2 you start getting real DBSCAN results.
GDBSCAN suggests to use 2*dim-1
instead of 4; I usually start with minPts=10
, then try 20
. There are several reasons to choose a larger minPts
:
minPts
(but for textual data, dimensionality is meaningless - at most choose by the intrinsic dimensionality)minPts
minPts
But don't overshoot. Index efficiency drops substantially with a large query radiuses. You want to choose minPts
as small as you can, while still getting an interesting result. Also do use multiple values, to get different views.
Remember that clustering is explorative data mining. It is meant to require you to experiment with parameters, and study the result, repeat. Because there is no correct clustering result. The quality of a clustering result is whether you can get a new insight on your data. A clustering that only reproduces a known result has in fact failed.