I have a dataset of 6 elements. I computed the distance matrix using Gower distance, which resulted in the following matrix:
By just looking at this matrix, I can tell that element #0 is similar to element #4 and #5 the most, so I assumed the output of the HDBSCAN would be to cluster those together, and assume the rest are outliers; however, that wasn't the case.
clusterer = hdbscan.HDBSCAN(min_cluster_size=2, min_samples=3, metric='precomputed',cluster_selection_epsilon=0.1, cluster_selection_method = 'eom').fit(distance_matrix)
Clusters Formed:
Cluster 0: {element #0, element #2}
Cluster 1: {element #4, element #5}
Outliers: {element #1, element #3}
which is a behavior I don't understand. Also, both parameters cluster_selection_epsilon
and cluster_selection_method
don't seem to have an effect on my results at all and I don't understand why.
I tried changing the parameters again to min_cluster_size=2, min_samples=1
Clusters Formed:
Cluster 0: {element #0, element #2,element #4, element #5}
Cluster 1: {element #1, element #3}
and any other change in the parameters resulted in all points classified as outliers.
Can someone please help explain this behavior, and explain why cluster_selection_epsilon
and cluster_selection_method
don't affect the clusters formed. I thought that by setting cluster_selection_epsilon
to 0.1, I'd be ensuring that the points inside a cluster would be of distance 0.1 or less apart (so that element #0 and element #2 aren't clustered together for instance)
As touched upon in the help page, the core of hdbscan is 1) calculating the mutual reachability distance and 2) applying the single linkage algorithm. Since you do not have that many data points and your distance metric is pre-computed, you can see your clustering is decided by the single linkage:
import numpy as np
import hdbscan
import matplotlib.pyplot as plt
import seaborn as sns
x = np.array([[0.0, 0.741, 0.344, 1.0, 0.062, 0.084],
[0.741, 0.0, 0.648, 0.592, 0.678, 0.657],
[0.344, 0.648, 0.0, 0.648, 0.282, 0.261],
[1.0, 0.592, 0.655, 0.0, 0.937, 0.916],
[0.062, 0.678, 0.282, 0.937, 0.0, 0.107],
[0.084, 0.65, 0.261, 0.916, 0.107, 0.0]])
clusterer = hdbscan.HDBSCAN(min_cluster_size=2,min_samples=1,
metric='precomputed').fit(x)
clusterer.single_linkage_tree_.plot(cmap='viridis', colorbar=True)
The results will be:
clusterer.labels_
[0 1 0 1 0 0]
Because the minimum number of clusters has to be 2. So the only way the achieve this is to have element 0,2,4,5 together.
One quick solution is to simply cut the tree and get the cluster you intended:
clusterer.single_linkage_tree_.get_clusters(0.15, min_cluster_size=2)
[ 0 -1 -1 -1 0 0]
Or you simply use something from sklearn.cluster.AgglomerativeClustering since you are not relying on hdbscan to calculate the distance metrics.