Search code examples
rcluster-analysissimilarity

R: Compute the row-wise similarity for a dataframe and sort data out based on that similarity


Hi guys sorry but I'm having a mental knot over the row wise similarity comparison. I have a table of clustering results over a few hundred runs. And they look like this enter image description here

First column is sample ID and then each run I asked the Kmeans to give me 8 clusters, with a 10% sample drop out per run for stability test.

Because each run is independent, so cluster 1 in run_0 does not equal to cluster 1 in run_1, these numbers are randomly assigned.

I want to compute the similarity of each sample row, that is I want to know what samples are staying together for the most of the time during these hundreds of runs.

I have seen this post and it is almost what I want

How to calculate the similarity for all the rows in a table in R?

However, I just did not quite get the function.

row_cf <- function(x, y, df){
  sum(df[x,] == df[y,])/ncol(df)
}

Could you guys explain to me a little more on this function. I don't understand why the sum of (df[x,] == df[y,])/ncol(df)) could represent the similarity between row X and Y. Is this asking how many columns in row x are equal to row y, and then sum them up and give me the proportion of the similar columns?

If this is the case, then assigning all NAs to a fixed value like 9 will increase the similarity right?

Calculate row similarity percentage pair wise and add it as a new colum

I have seen this post, and the output is kinda what I want.My end goal is to have an output that Patient sample ID in first column to represent the entire data, and 2nd column will be the Patient sample ID of the most similar sample, and the 3rd column is the similarity score.

Dummy data if you need

Sample <- LETTERS[seq( from = 1, to = 20 )]
run_1 <- rep(1:4, each=5)
run_2 <- c(rep(1:2, each=4),rep(3:4,6))
run_3 <- rep(4:1, each=5)
run_4 <- c(rep(4:3, each=4),rep(1:2,6))

df <- data.frame(cbind(Sample, run_1,run_2,run_3,run_4))


#switch off row names
df1 <- df %>% remove_rownames() %>%
  column_to_rownames(var="patient_sample")


#replace NA to some value outside the cluster ID range

df1[is.na(df1)] <- 10



# define a similary funciton

 row_cf <- function(x, y, df){
   sum(df[x,]==df[y,])/ncol(df)
 }


#calculate the similarity

Sim <- expand.grid(1:nrow(df1), 1:nrow(df1)) %>%
  rename(row_1 = Var1, row_2 = Var2) %>%
  rowwise() %>%
  mutate(similarity = row_cf(row_1, row_2, df1)) %>%
  filter(row_1 != row_2) %>%
  group_by(row_1) %>%
  slice(which.max(similarity))

 #join to known data table

 df1 %>% mutate(row_1 = 1:n()) %>%
   left_join(Sim)

This is my modified attempt but it doesn't quite cut the job. If I use the join table, I lose the row names still.

My idea out put is to have

    Row_1   Row_2  Similarity

    A        C       90%
    B        E       90%
    C        J       88%
    D        N       80%
    E        Y       70%
    F        G       60%

The reason that I want to keep ID is eventually I want to see what samples are most similar like the above post did, but I also want to sort them out into 8 clusters based on that similarity, so to achieve the final stable 8 clusters of samples. How could I sort this segmentation out? run a hierarchal clustering?


Solution

  • I don't think it's a good idea to replace the NAs with a code, as that would be to assume that all the NAs are the same, which I don't think is appropriate. Your choice of similarity metric is good, but as it's symmetric we can avoid half the comparisons.

    Example data

    set.seed(1)
    
    Sample <- LETTERS[1:18]
    r <- sort(rep(1:6, 3))
    
    df <- replicate(20, {
        ix <- sample(1:length(r), 7)
        r[ix] <- sample(r[ix], 7, rep=TRUE)
        r
    })
    
    df[sample(1:length(df), 40)] <- NA
    df <- cbind(Sample, data.frame(df), stringsAsFactors=FALSE)
    

    Calculating pairwise Hamming distance

    pair <- t(combn(1:nrow(df), 2))
    similarity <- numeric(nrow(pair))
    id <- matrix("", nrow(pair), 2)
    
    m <- matrix(NA, nrow(df), nrow(df))
    dimnames(m) <- list(df[,1], df[,1])
    
    hamming <- function(a, b) {
        sum(a == b, na.rm=TRUE)/length(a)
    }
    
    for (i in 1:nrow(pair)) {
        r <- pair[i,]
        similarity[i] <- hamming(df[r[1], -1], df[r[2], -1]) 
        id[i, ] <- df[r, 1]
        m[id[i, , drop=FALSE]] <- similarity[i]
    }
    
    out <- data.frame(id, similarity, stringsAsFactors=FALSE)
    out <- out[order(similarity, decreasing=TRUE), ]
    rownames(out) <- NULL
    
    head(out)
    #   X1 X2 similarity
    # 1  B  C       0.60
    # 2  A  B       0.50
    # 3  M  N       0.45
    # 4  P  R       0.45
    # 5  A  C       0.40
    # 6  G  H       0.40
    
    kmeans(as.dist(t(m)), 4)$cluster
    # A B C D E F G H I J K L M N O P Q R 
    # 1 1 1 2 2 2 2 4 2 2 2 2 4 4 4 3 3 3 
    
    plot(hclust(1-as.dist(t(m))))
    

    enter image description here