Search code examples
machine-learningscikit-learncluster-analysishierarchical-clusteringhdbscan

HDBSCAN difference between parameters


I'm confused about the difference between the following parameters in HDBSCAN

  1. min_cluster_size
  2. min_samples
  3. cluster_selection_epsilon

Correct me if I'm wrong.

For min_samples, if it is set to 7, then clusters formed need to have 7 or more points. For cluster_selection_epsilon if it is set to 0.5 meters, than any clusters that are more than 0.5 meters apart will not be merged into one. Meaning that each cluster will only include points that are 0.5 meters apart or less.

How is that different from min_cluster_size?


Solution

  • They technically do two different things.

    min_samples = the minimum number of neighbours to a core point. The higher this is, the more points are going to be discarded as noise/outliers. This is from DBScan part of HDBScan.

    min_cluster_size = the minimum size a final cluster can be. The higher this is, the bigger your clusters will be. This is from the H part of HDBScan.

    Increasing min_samples will increase the size of the clusters, but it does so by discarding data as outliers using DBSCAN.

    Increasing min_cluster_size while keeping min_samples small, by comparison, keeps those outliers but instead merges any smaller clusters with their most similar neighbour until all clusters are above min_cluster_size.

    So:

    1. If you want many highly specific clusters, use a small min_samples and a small min_cluster_size.
    2. If you want more generalized clusters but still want to keep most detail, use a small min_samples and a large min_cluster_size
    3. If you want very very general clusters and to discard a lot of noise in the clusters, use a large min_samples and a large min_cluster_size.

    (It's not possible to use min_samples larger than min_cluster_size, afaik)