Search code examples
rdendrogramhclustdendextend

R hclust height of final merge


When performing the hierarchical clustering in R with the hclust function. How do you know the height of the final merge?

So to clarify with some R default data:

hc <- hclust(dist(USArrests))
dendrogram1 = as.dendrogram(hc)
plot(hc)

Will result in a variable hc with all clustering info.

R clustering output

And the dendrogram:

R dendrogram

As you can see on the dendrogram, the final merge happens at a height > 200 (about 300). But how does the dendrogram know? This info is not in the hc.height variable nor in the dendrogram1 variable. The highest mentioned merge is at 169.

variable dendrogram1

If the dendrogram1 variable does not contain this information, how does the plot function know the merge must occur at a height of 300?

dendrogram R top merge

I am asking this because I require this number (+- 300) for other applications and reading it from the plot is downright impractical.

thanks in advance for anyone willing to help!


Solution

  • These values can be calculated with stats::cophenetic():

    The cophenetic distance between two observations that have been clustered is defined to be the intergroup dissimilarity at which the two observations are first combined into a single cluster.

    This yields the following for your example:

    sort(unique(cophenetic(hc)))
    #  [1]   2.291   3.834   3.929   6.237   6.638   7.355   8.027   8.538  10.860
    # [10]  11.456  12.425  12.614  12.775  13.045  13.297  13.349  13.896  14.501
    # [19]  15.408  15.454  15.630  15.890  16.977  18.265  19.438  19.904  21.167
    # [28]  22.366  22.767  24.894  25.093  28.635  29.251  31.477  31.620  32.719
    # [37]  36.735  36.848  38.528  41.488  48.725  53.593  57.271  64.994  68.762
    # [46]  87.326 102.862 168.611 293.623