Search code examples
machine-learninghierarchical-clustering

Determining groups in a hierarchical cluster


I have an algorithm that can group data into a hierarchical cluster tree. The algorithm is the one described in Toby Seagram's Programming Collective Intelligence. The tree output is a binary tree with a "distance" value at each node, that tells you how far apart the two child nodes are.

I can then display this as a Dendrogram and it makes it fairly easy for a human spot which values are grouped together. However I'm having difficult coming up with an algorithm that automatically decides what the groups should be. I'd like to be able to determine automatically:

  • The number of group
  • Which points should be placed in each group

Is there a standard algorithm for this?


Solution

  • I think there is no default way to do this. Simple 'manual' methods would be to either:

    • specify the number of clusters you want/expect
    • set a threshold for the maximum distance between two nodes; any nodes with a larger distance belong to another cluster

    There are some automatic methods to determine the number of clusters. R has the Dynamic Tree Cut package which automatically deals with this problem, also pvclust could be used. Here are two more methods described to deal with this problem, Salvador (2002) and Daniels (2006).