Search code examples
rcorrelationpearson-correlation

How do I run multivariable correlation?


I have a large amount of data which I'd like to divide by multiple variables, as in the following plot:

enter image description here

There are a total of 63 plots here, divided by 3 variables (rows, cols and fram). In reality, of course, valuex and valuey have more than 3 observations. I would like to find the Pearson correlation for every single one of these as efficiently as possible and I'm kinda blanking on ideas.

Here's some example data with which the plot was created:

example_df <- data.frame(rows = rep(c('r1', 'r2', 'r3'), 63),
                         cols = rep(letters[1:7], 27),
                         fram = rep(c('X', 'Y', 'Z'), each = 63),
                         valuex = rnorm(189),
                         valuey = rnorm(189))

Solution

  • You can use dplyr to group_by multiple variables then summarize to get the cor between valuex and valuey for each subgroup:

    library(dplyr)
    
    example_df %>% group_by(rows, cols, fram) %>% summarize(cor = cor(valuex, valuey))
    #> # A tibble: 63 x 4
    #> # Groups:   rows, cols [21]
    #>    rows  cols  fram     cor
    #>    <chr> <chr> <chr>  <dbl>
    #>  1 r1    a     X     -0.709
    #>  2 r1    a     Y      0.178
    #>  3 r1    a     Z     -0.597
    #>  4 r1    b     X     -0.338
    #>  5 r1    b     Y      0.981
    #>  6 r1    b     Z     -0.731
    #>  7 r1    c     X      0.945
    #>  8 r1    c     Y     -0.913
    #>  9 r1    c     Z      0.177
    #> 10 r1    d     X      0.999
    #> # ... with 53 more rows
    

    Created on 2020-07-14 by the reprex package (v0.3.0)