I'm working on a comparison between clustring algorithms and I want to know how HDBSCAN in R calculate the so called the membership 'probability' ?
In the dbscan
package, the hdbscan()
function does some validity checking of the object passed as input, and then calculates a distance matrix to its k
nearest neighbors using the dbscan::kNNdist()
function. The value of k
is set to the argument minPts
that is passed to the dbscan()
function less 1.
core_dist <- kNNdist(x, k = minPts - 1)
It then uses core distance as the measure of density and calculates membership probabilities using the following algorithm (from the hdbscan.R source ):
## Generate membership 'probabilities' using core distance as the measure of density
prob <- rep(0, length(cl))
for (cid in sl){
ccl <- res[[as.character(cid)]]
max_f <- max(core_dist[which(cl == cid)])
pr <- (max_f - core_dist[which(cl == cid)])/max_f
prob[cl == cid] <- pr
}
For each cluster id in the salient clusters object sl
, the algorithm calculates the maximum core distance, and then builds probabilities by subtracting each element's distance from the maximum distance, dividing the result by the maximum distance to convert it a proportion.
These coverage probabilities are then inserted into the list that is output by the hdbscan()
function as the membership_prob
object.