Search code examples
rcluster-analysiscorrelation

Find combination of n vectors across k dataframes with highest correlation


Let's assume four data frames, each with 3 vectors, e.g.

setA <- data.frame(
  a1 = c(6,5,2,4,5,3,4,4,5,3),
  a2 = c(4,3,1,4,5,1,1,6,3,2),
  a3 = c(5,4,5,6,4,6,5,5,3,3)
)

setB <- data.frame(
  b1 = c(5,3,4,3,3,6,4,4,3,5),
  b2 = c(4,3,1,3,5,2,5,2,5,6),
  b3 = c(6,5,4,3,2,6,4,3,4,6)
)

setC <- data.frame(
  c1 = c(4,4,5,5,6,4,2,2,4,6),
  c2 = c(3,3,4,4,2,1,2,3,5,4),
  c3 = c(4,5,4,3,5,5,3,5,5,6)
)

setD <- data.frame(
  d1 = c(5,5,4,4,3,5,3,5,5,4),
  d2 = c(4,4,3,3,4,3,4,3,4,5),
  d3 = c(6,5,5,3,3,4,2,5,5,4)
)

I'm trying to find n number of vectors in each data frame, that have the highest correlation among each other. For this simple example, let's say want to find the n = 1 vectors in each of the k = 4 data frames, that show the overall strongest, positive correlation cor().

I'm not interested in the correlation of vectors within a data frame, but the correlation between data frames, since i wish to pick 1 variable from each set.

Intuitively, I would sum all the correlation coefficients for each combination, i.e.:

sum(cor(cbind(setA$a1, setB$b1, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b2, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b1, setC$c2, setC$d1)))
... # and so on...

...but this seems like brute-forcing a solution that might be solvable more elegantly, with some kind of clustering-technique?

Anyhow, I was hoping to find a dynamic solution like function(n = 1, ...) where (... for data frames) which would return a list of the highest correlating vector names.


Solution

  • Base on your example I would not go with a really complicated algorithm unless your actual data is huge. This is a simple approach I think gets what you want. So base on your 4 data frames a creates the list_df and then in the function I just generate all the possible combinations of variables an calculate their correlation. At the end I select the n combinations with highest correlation.

    list_df = list(setA,setB,setC,setD)
    
    CombMaxCor = function(n = 1,list_df){
    
      column_names = lapply(list_df,colnames)
      mat_comb     = expand.grid(column_names)
      mat_total    = do.call(cbind,list_df)
      vec_cor      = rep(NA,nrow(mat_comb))
    
      for(i in 1:nrow(mat_comb)){
        vec_cor[i] = sum(cor(mat_total[,as.character(unlist(mat_comb[i,]))]))
      }
      pos_max_temp = rev(sort(vec_cor))[1:n]
      pos_max      = vec_cor%in%pos_max_temp
      comb_max_cor = mat_comb[pos_max,]
      return(comb_max_cor)
    }