Is there an easy way to merge multiple hclust objects (or dendrograms) at the root?
I've made the example as complete as possible to illustrate my problem.
Let's say I'd like to cluster USArrests by region and then unite all the hclust objects to plot them together in a heatmap.
USArrests
Northeast <- c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island",
"Vermont", "New Jersey", "New York", "Pennsylvania")
Midwest <- c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin",
"Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota",
"South Dakota")
South <- c("Delaware", "Florida", "Georgia", "Maryland", "North Carolina",
"South Carolina", "Virginia", "West Virginia",
"Alabama", "Kentucky", "Mississippi", "Tennessee", "Arkansas",
"Louisiana", "Oklahoma", "Texas")
West <- c("Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico",
"Utah", "Wyoming", "Alaska", "California", "Hawaii", "Oregon", "Washington")
h1 <- hclust(dist(USArrests[Northeast,]))
h2 <- hclust(dist(USArrests[Midwest,]))
h3 <- hclust(dist(USArrests[South,]))
h4 <- hclust(dist(USArrests[West,]))
Now I have 4 hclust objects (h1 through h4). I usually merge them like so:
hc <- as.hclust(merge(merge(merge(
as.dendrogram(h1), as.dendrogram(h2)), as.dendrogram(h3)),
as.dendrogram(h4)))
Then, to plot them, I have to reorder the matrix according to the hclust objects, then plot (I've added some annotation to make the plot clearer):
usarr <- USArrests[c(Northeast, Midwest, South, West),]
region_annotation <- data.frame(Region = c(rep("Northeast", length(Northeast)),
rep("Midwest", length(Midwest)),
rep("South", length(South)),
rep("West", length(West))),
row.names = c(Northeast, Midwest, South, West))
pheatmap(usarr, cluster_rows = hc,
annotation_row = region_annotation)
In summary: is there an easier way to do this than merging all the separate hclusts?
I ended up making a couple of functions to do this more automatically. (In my version, I've also added support for correlation "distance", so it's a little bigger)
hclust_semisupervised <- function(data, groups, dist_method = "euclidean",
dist_p = 2, hclust_method = "complete") {
hclist <- lapply(groups, function (group) {
hclust(dist(data[group,], method = dist_method, p = dist_p), method = hclust_method)
})
hc <- .merge_hclust(hclist)
data_reordered <- data[unlist(groups),]
return(list(data = data_reordered, hclust = hc))
}
.merge_hclust <- function(hclist) {
#-- Merge
d <- as.dendrogram(hclist[[1]])
for (i in 2:length(hclist)) {
d <- merge(d, as.dendrogram(hclist[[i]]))
}
as.hclust(d)
}
Having USArrests and the region vectors, I call hclust_semisupervised
like so:
semi_hc <- hclust_semisupervised(USArrests, list(Northeast, Midwest, South, West)
Now plotting the heatmap:
pheatmap(semi_hc$data, cluster_rows = semi_hc$hclust,
annotation_row = region_annotation)