Search code examples
cluster-analysisdendrogramdendextendheatmaply

How to calculate the cophenetic similarity between two individual in two dendograms or between two clustering methods?


How can I calculate the cophenetic distance for an individual within two trees (not between two whole trees)?

I want to calculate the similarity/dissimilarity in position per individual within two dendrograms and show the result in the row color of a combined heatmap and dendrogram using R packages dendextend and heatmaply.


Solution

  • Thanks all for the help, based on the links provided by vilisSO and the answer from Grant, I made the following code to calculate the correlation between cophenetic distance in two trees based on full data and a sub sample of the data. For each leave in the dendrogram, the correlation is calculated between the for the cophenetic distances vector in the two trees o: enter image description here

    ## Compare cophenetic similarity between leaves in two trees build on full data and subsample of the data
    
    # 1 ) Generate random data to build trees
    set.seed(2015-04-26)
    dat <- (matrix(rnorm(100), 10, 50)) # Dataframe with 50 columns
    datSubSample <- dat[, sample(ncol(dat), 30)] #Dataframe with 30 columns sampled from the dataframe with 50
    dat_dist1 <- dist(datSubSample)
    dat_dist2 <- dist(dat)
    hc1 <- hclust(dat_dist1)
    hc2 <- hclust(ddat_dist2)
    
    # 2) Build two dendrograms, one based on all data, second based a sample of the data (30 out of 50 columns)
    dendrogram1 <- as.dendrogram(hc1)
    dendrogram2 <- as.dendrogram(hc2)
    
    # 3) For each leave in a tree get cophenetic distance matrix, 
    # each column represent distance of that leave to all others in the same tree
    cophDistanceMatrix1 <- as.data.frame(as.matrix(cophenetic(dendrogram1)))
    cophDistanceMatrix2 <- as.data.frame(as.matrix(cophenetic(dendrogram2)))
    
    # 4) Calculate correlation between cophenetic distance of a leave to all other leaves, between two trees
    corPerLeave <- NULL # Vector to store correlations for each leave in two trees
    for (leave in colnames(cophDistanceMatrix1)){
      cor <- cor(cophDistanceMatrix2[leave],cophDistanceMatrix1[leave])
      corPerLeave <- c(corPerLeave, unname(cor))
    }
    
    # 5) Convert cophenetic correlation to color to show in side bar of a heatmap
    corPerLeave <-corPerLeave/max(corPerLeave) #Scale 0 to 1 correlation
    byPal <- colorRampPalette(c('yellow','blue')) #blue yellow color palette, low correlatio = yellow
    colCopheneticCor <- byPal(20)[as.numeric(cut(corPerLeave, breaks =20))]
    
    # 6) Plot heatmap with dendrogram with side bar that shows cophenetic correlation for each leave 
    row_dend  <- dendrogram2[enter image description here][1]
    x  <- as.matrix(dat_dist)
    heatmaply(x,colD = row_dend,row_side_colors=colCopheneticCor)