Search code examples
rtreehierarchical-clusteringdendrogramgraph-visualization

Drawing a dendrogram knowing all merges beforehand in R


I'm looking to use R to draw a figure that looks like a hierarchical clustering tree (a dendrogram), except in my case, I already know which clusters merge with which.

Example: Assume we have objects 'a', 'b', 'c', 'd' and 'e'. They start out as 5 clusters 1, 2, 3, 4 and 5. Now I want 1 and 2 to merge to a new cluster (cluster 6), then merge cluster 3 and 5 (cluster 7), then merge cluster 6 with cluster 4 (cluster 8) and stop there. This tree would then be specified by a list [ (1,2), (3,5), (6,4) ].

Hopefully the description is clear enough. There are basically two subproblems to solve here:

  • Make a clustering object from an entirely supervised process;
  • Cut off the dendrogram before reaching the top;

If the latter is too much for one question, it is fine that you leave it out in your answer.


Solution

  • Here is an attempt at manually constructing an object of class "hclust".

    First - check what attributes this object should have:

    fit <- hclust(dist(USArrests))
    names(fit)
    [1] "merge"   "height"  "order"   "labels"   "method"   "call"   "dist.method"
    

    Second - check what information has to be present in all of those:

    help(hclust)
    # read the section called "Values"
    

    Third - create an object and add the merge information:

    obj <- list()
    obj$merge <- rbind(c(-1, -2), c(-3, -5), c(-4, 1), c(2, 3))
    

    NOTE: according to the help page for hclust() merge should be a two-column matrix specifying which objects are merged at each step. Seems like it has to include all the steps to merge all groups into one final tree, so you probably will not be able to stop half-way (as that would be 3 separate trees). Negative values indicate leaves (i.e. c(-1, -2) indicates that at first step observations 1 and 2 are merged). Positive values refer to clusters obtained in the previous steps (i.e. c(-4, 1) indicates that at this step observation 4 is merged with a cluster that was obtained at step 1).

    Fourth - add heights:

    obj$height <- 1:4
    

    NOTE: this holds the height of the merge for each of the merges.

    Fifth - provide the order of observations:

    obj$order  <- c(1,2,4,3,5)
    

    NOTE: this is the order of how the observations are displayed on the x-axis. Required so that the branches not overlap. You can provide an order with overlaps but then the final dendrogram picture will not look pretty.

    Sixth - add labels:

    obj$labels <- 1:5
    

    NOTE: these are the names of our leaves in the final tree.

    Seventh - bless our object with a class:

    class(obj) <- "hclust"
    

    NOTE: this is needed for the plot() function to choose the suitable method.

    Eighth - plot the result:

    plot(obj, hang=-1)
    rect(1, 2.5, 99, 99, col="white", border="white")
    

    img

    NOTE: hang argument makes all the leaves be on the same y-axis level and rect draws a white rectangle visually hiding the tree above which imitates your requirement not to join all the objects fully.

    There might be a simpler/better way, but I do not know about it.