Search code examples
rdataframecorrelationr-colnames

Calculate correlation between two columns based on column names


I am trying to calculate a correlation between two columns in a dataframe, based on a matching pattern in column name. In this case, I want to get the correlation between columns 'Obs1_grp1' and 'Obs1_grp2', 'Obs2_grp1' and 'Obs2_grp2', and so on.

> test
   Individual Obs1_grp1 Obs2_grp1 Obs3_grp1 Obs1_grp2 Obs2_grp2 Obs3_grp2
1      293748      1583   0.09858    0.7639     15.70    0.5877      3230
2      294934      1353   0.11090    0.7639     15.58    0.5697      3604
3      290652      1400   0.11030    0.7639     14.93    0.5779      3462
4      291320      1377   0.11480    0.7639     15.63    0.5706      3130
5      292553      1253   0.10640    0.7639     15.41    0.5750      3601
6      295073      1306   0.09881    0.7639     15.54    0.5829      3630
7      290966      1523   0.10960    0.7639     15.48    0.5724      3274
8      293501      1472   0.10090    0.7639     15.31    0.5764      3507
9      295464      1603   0.10000    0.7639     15.66    0.5816      3512
10     291367      1476   0.09944    0.7639     15.38    0.5805      3116
11     295007      1517   0.11290    0.7639     14.98    0.5773      3130
12     295006      1340   0.11290    0.7639     14.98    0.5773      3130
13     290570      1314   0.10000    0.7639     15.45    0.5806      3759
14     293940      1515   0.10350    0.7639     15.55    0.5824      3564
15     293925      1259   0.10890    0.7639     15.15    0.5757      3593
16     291353      1494   0.09944    0.7639     15.38    0.5805      3116
17     292783      1637   0.10180    0.7639     15.38    0.5754      3396
18     290833      1202   0.11030    0.7639     13.66    0.5743      3106
19     291826      1298   0.10570    0.7639     16.26    0.5786      3946
20     294410      1539   0.10070    0.7639     15.39    0.5829      3550

I was thinking to use a grep function, but not sure how?

Also, I would like to put the calculated correlation in a datatable and specify for each correlation the Obs. Something like this:

    > correl
    Obs   cor 
1   Obs1  0.853
2   Obs2  0.952

Hope anyone can help here.


Solution

  • You can create a function like this:

    cor_f <- function(x) {
      
      cor(test[,names(test)[grepl(x, names(test))]])[2]
      
    }
    
    cor_f('Obs1') #correlation between Obs1_grp1 and Obs1_grp2
    #0.3159908
    

    In case you need a loop, one way would be:

    vars <- c('Obs1', 'Obs2')    
    sapply(vars, function(i) cor_f(i))