DBSCAN(D, eps, MinPts)
C = 0
for each unvisited point P in dataset D
mark P as visited
NeighborPts = regionQuery(P, eps)
if sizeof(NeighborPts) < MinPts
mark P as NOISE
else
C = next cluster
expandCluster(P, NeighborPts, C, eps, MinPts)
expandCluster(P, NeighborPts, C, eps, MinPts)
add P to cluster C
for each point P' in NeighborPts
if P' is not visited
mark P' as visited
NeighborPts' = regionQuery(P', eps)
if sizeof(NeighborPts') >= MinPts
NeighborPts = NeighborPts joined with NeighborPts'
if P' is not yet member of any cluster
add P' to cluster C
regionQuery(P, eps)
return all points within P's eps-neighborhood
Above is. as you can see, the algorithm of DBSCAN according to Wikipedia.
I want to ask about this exact part.
NeighborPts = NeighborPts joined with NeighborPts'
My understanding was that if a core point from a neighbor of core point is visited, it will be joined to the currently examined cluster, right? But how does the recursion happen here? Because we has defined the loop of:
for each point P' in NeighborPts
before the process of the joining, so any additional point from NeighborPts' won't be examined by the expandCluster function and if the new NeighborPts' actually has a point that is an another core point to the same cluster, how does the algorithm proceed?
I have a code with the implementation of 'expandCluster' method in Java:
public void expand(Vector<Integer> region, Group c, double dist, int minPts){
for(int i = 0; i < region.size(); i++){
int idx = region.get(i);
if(labels[idx] == 0){ // check if point is visited
labels[idx] = 1; // mark as visited
Vector<Integer> v = region(idx, dist); // check for neighboring point
if (v.size() >= minPts){ // check if core point
region.addAll(v); // join the NeighborPts
}
}
if(clustered[idx] == 0){
c.elements.add(patterns.get(idx));
clustered[idx] = clusters.size()+1;
}
}
}
Will the data collection region
going to be revisited after the modification of the data collection through this code region.addAll(v);
?
My understanding was that if a core point from a neighbor of core point is visited, it will be joined to the currently examined cluster, right?
Yes you're right and you can safely remove the line
if P' is not visited
However, this is not efficient.
If a point P' has been visited there is no need to compute its neighborhood and join it with the neighborhood of P.
It is visited means that: it's a noise point, it is already in a cluster or it's a border point. If it's already in a cluster and if it's a core point, this means that its neighbors have been processed already. If it's a border point then its neighbors must not be joined.
But how does the recursion happen here?
In the line
for each point P' in NeighborPts
you must consider NeighborPts
as a dynamic container of points. The first time you enter the for loop NeighborPts
contains X
points. If the joining adds Y
points to NeighborPts
then the for loop will visit both X
and Y
sets. This will then repeat for the sets X
and Y
and this is how the recursion happens.
Will the data collection region going to be revisited after the modification of the data collection through this code region.addAll(v);?
Yes, every time you call region.addAll(v)
, region.size()
increases and this confirms the recursion behavior that confuses you.