Search code examples
rkey-value

extract correlations of sub-sets of genes based on a key -> value data frame


I have two data frames. The first one contains a gene-gene correlation matrix, 1484 x 1484 (each cell corresponds to the correlation value between I and J genes). The second one contains a key -> value sort of information, and it looks like this:

                       Complex            Protein_ID
1                      BCL6-HDAC4 complex       Bcl6
125                    BCL6-HDAC5 complex      Hdac5
249                    BCL6-HDAC7 complex       Bcl6
373 Multisubunit ACTR coactivator complex      Ep300
497                   Condensin I complex       Smc2
621                                BLOC-3       Hps4

I am interested in extracting the correlations of genes belonging to the same complex from my matrix and storing them on a new data frame, where I will have, per complex, the values of gene-gene correlations. It would ideally look like this:

#this is a simulated data.frame

                    Complex                                Correlation values
                    BCL6-HDAC4 complex                     0.64
                    BCL6-HDAC4 complex                     -0.25
                    Multisubunit ACTR coactivator complex  0.31
                    Multisubunit ACTR coactivator complex  0.30

Any ideas on how I can get there?


Solution

  • library(data.table) # >= V1.15.0
    
    df <-
      melt(data.table(cors),                    # matrix to long data.frame
           variable.name = "i",
           value.name = "cor"
      )[, let(i = as.integer(i), j = rowid(i))  # cols for i and j
      ][i < j                                   # keep distinct correlations
      ][, Complex := lkps$Complex[i]            # look up Complex for i
      ][Complex == lkps$Complex[j]]             # keep if Complex for j is same
    

    Example data (10 genes, 3 groups, only showing first 6 cols of correlation matrix):

    set.seed(1)
    n_genes <- 10
    cors <- cor(matrix(rnorm(n_genes * 50), nrow = 50, ncol = n_genes))
    lkps <- data.frame(
      Complex = sample(c("Complex A", "Complex B", "Complex C"), n_genes, replace = TRUE),
      Protein_ID = replicate(n_genes, paste0(sample(c(letters, LETTERS), 4, replace = TRUE), collapse = "")))
    
    > cors
                 [,1]         [,2]         [,3]        [,4]         [,5]        [,6]
     [1,]  1.00000000 -0.039087178  0.026287227 -0.27185574  0.013674895 -0.11933102
     [2,] -0.03908718  1.000000000  0.003552006 -0.02391178  0.039833039  0.02218480
     [3,]  0.02628723  0.003552006  1.000000000  0.21648782  0.127791868  0.12197135
     [4,] -0.27185574 -0.023911775  0.216487818  1.00000000 -0.082713154 -0.24277681
     [5,]  0.01367489  0.039833039  0.127791868 -0.08271315  1.000000000  0.09888519
     [6,] -0.11933102  0.022184800  0.121971345 -0.24277681  0.098885194  1.00000000
     [7,]  0.19468192  0.006755358 -0.074116195  0.12591453  0.184806771 -0.14283941
     [8,] -0.14785348 -0.255064246 -0.054761988 -0.03252786  0.004459162  0.03851846
     [9,]  0.02336706  0.198299294  0.069506207  0.14657036  0.183043022 -0.10887799
    [10,] -0.36678892  0.240101899  0.031648477  0.17387651  0.131315992 -0.12944992
    
    > lkps
         Complex Protein_ID
    1  Complex C       jMXs
    2  Complex C       ruTw
    3  Complex A       zoCU
    4  Complex C       PCev
    5  Complex A       aWvm
    6  Complex B       vfRO
    7  Complex A       GxvG
    8  Complex B       jSsh
    9  Complex B       lkpQ
    10 Complex B       ufxz
    

    Result:

                cor     i     j   Complex
              <num> <int> <int>    <char>
     1: -0.03908718     1     2 Complex C
     2: -0.27185574     1     4 Complex C
     3: -0.02391178     2     4 Complex C
     4:  0.12779187     3     5 Complex A
     5: -0.07411620     3     7 Complex A
     6:  0.18480677     5     7 Complex A
     7:  0.03851846     6     8 Complex B
     8: -0.10887799     6     9 Complex B
     9: -0.12944992     6    10 Complex B
    10: -0.05267148     8     9 Complex B
    11:  0.04892611     8    10 Complex B
    12:  0.18778267     9    10 Complex B