I have a large amount of data which I'd like to divide by multiple variables, as in the following plot:
There are a total of 63 plots here, divided by 3 variables (rows
, cols
and fram
). In reality, of course, valuex
and valuey
have more than 3 observations. I would like to find the Pearson correlation for every single one of these as efficiently as possible and I'm kinda blanking on ideas.
Here's some example data with which the plot was created:
example_df <- data.frame(rows = rep(c('r1', 'r2', 'r3'), 63),
cols = rep(letters[1:7], 27),
fram = rep(c('X', 'Y', 'Z'), each = 63),
valuex = rnorm(189),
valuey = rnorm(189))
You can use dplyr
to group_by
multiple variables then summarize
to get the cor
between valuex
and valuey
for each subgroup:
library(dplyr)
example_df %>% group_by(rows, cols, fram) %>% summarize(cor = cor(valuex, valuey))
#> # A tibble: 63 x 4
#> # Groups: rows, cols [21]
#> rows cols fram cor
#> <chr> <chr> <chr> <dbl>
#> 1 r1 a X -0.709
#> 2 r1 a Y 0.178
#> 3 r1 a Z -0.597
#> 4 r1 b X -0.338
#> 5 r1 b Y 0.981
#> 6 r1 b Z -0.731
#> 7 r1 c X 0.945
#> 8 r1 c Y -0.913
#> 9 r1 c Z 0.177
#> 10 r1 d X 0.999
#> # ... with 53 more rows
Created on 2020-07-14 by the reprex package (v0.3.0)