algorithm cluster-analysis hierarchical-clustering dendrogram unsupervised-learning

Understanding DynamicTreeCut algorithm for cutting a dendrogram

A dendrogram is a data structure used with hierarchical clustering algorithms that groups clusters at different "heights" of a tree - where the heights correspond to distance measures between clusters.

After a dendrogram is created from some input data set, it's often a further challenge to figure out where to "cut" the dendrogram, meaning selecting a height such that only clusters below that height are considered meaningful.

It's not always clear at what height to cut a dendrogram, but some algorithms exist, such as the DynamicTreeCut algorithm, that attempt to programatically select meaningful clusters from the dendrogram.

See:

https://stats.stackexchange.com/questions/3685/where-to-cut-a-dendrogram

Cutting dendrogram at highest level of purity

So I've been reading over the DynamicTreeCut algorithm, as well as a Java implementation of the algorithm. I understand how the algorithm works and what it's doing, in terms of a step by step breakdown of what's going on. However, I am failing to understand how this algorithm is doing anything meaningful. I think I am missing some key concept here.

In general, the algorithm iterates over a "height sequence" from a dendrogram. I'm not certain, but I assume that a "height sequence" just means the values along the Y axis of a dendrogram, i.e. the various heights at which cluster joins occur. If that's the case, we can assume that a "height sequence" is sorted in ascending order.

Then the algorithm calls for taking a "reference height", l, and subtracting it from each height in the input "height sequence". This gives you a vector of differences (D) between each height h[i] in the height sequence and the reference height l.

Then the algorithm attempts to find "transition points" - which are points in the differences vector where D[i] > 0 and D[i+1] < 0. In other words, points in the differences vector where the difference value transitioned from being positive to being negative.

And right here I am totally lost. I don't understand how these transition points can ever be meaningful. Firstly, it's my understanding that the input height sequence H is just the values on the Y axis of the dendrogram. So therefore the height sequence H should be in ascending order. Therefore, how can there ever be a point in the differences vector where we transition from positive to negative?

For example:

Suppose our input height sequence H is {1, 1.5, 2, 2.5, 3, 7, 9} and our reference value l is the average height (which would be 3.7). So if we create a difference vector D by subtracting l from each height in H, we'd get {-2.7, -2.2, -1.7, -1.2, -0.7, 3.3, 5.3}. So clearly there is no transition point here, nor could there ever be, because there are no points in the difference vector where D[i] > 0 and D[i+1] < 0, since the height sequence H is in ascending order.

So clearly I am totally misunderstanding something fundamental about this algorithm. Perhaps I am not understanding what is meant by the "height sequence". I am assuming it is simply the values on the Y-axis from the dendrogram, but clearly that doesn't make any sense with regard to what the algorithm is actually doing. Unfortunately, the authors don't really explain what is meant by a "dendrogram height sequence", nor does it appear to be some kind of standard terminology in use by the data science community.

So can something explain what the DynamicTreeCut algorithm is trying to achieve here, and where my understanding is going wrong?

Solution

I fully agree with your understanding of the paper and the algorithm. I reached the same conclusion as you did.

What I am offering is speculation. But I feel pretty confident about it.

The paper states that breakpoints give you limits between clusters. Two consecutive breakpoints define a cluster

The immediate conclusion is that you cannot give H as the ordered list of heights. Instead, it should be the height when you reorder the points for a visualization, i.e "such that the lines in the resulting plot do not cross"

In the example, H would become (0.70, 0.35, 0.90, 0.45, 0.77, 0.40, 0.30, 0.55, 0.05). To clarify, the first entry 0.70 is the height of the merging line between the 1st and the 2nd point (called 3 and 7 here). Note that the visualization is not unique, but the result of the algorithm in the end will be.

Breakpoints and transition are defined a contiguous set of points above the 0-line.

Conclusion : because inside a cluster the merging heights are low, and a cluster have their H-l values positive, you want a big pack of low merging heights (i.e : a cluster) to stand above the 0-line. So instead of using the merging heights, use the negative

In the example, H would become (-0.70, -0.35, -0.90, ...).

Let's try my assumption and have l = -0.75

H-l becomes (0.05, 0.40, -0.15, 0.30, -0.02, 0.35, 0.45, 0.20, 0.70)

And you have two transitions that define 3 clusters : (3-7-6), (1-4) and (9-5-8-10-2). See the picture for reference. Which feels really reasonable.

What I can also conclude is that it was a really roundabout way of saying that it was fixed height branch cut. Note that this is only TreeCutCore so all the dynamic part is left to be done. Honestly it's not that complex when you realize they just do recursive calls to TreeCutCore with varying heights on smaller and smaller clusters.

Also, as another insurance that I am not completely wrong when you have several negative values one after the other, it means that it corresponds to singletons that are outliers which is exactly what Node 54 (of Figure 5 of the paper you linked) is. Several contiguous negative values don't form a cluster themselves, they are singletons really dissimilar from each other. Only contiguous groups above 0 form a cluster