Search code examples
rtidyverseglmbroompermute

permute a column within a level, perform an test on 2 columns, and save the pvalues


I have a data frame

> dput(df)
structure(list(id = c(1, 2, 3, 4, 1, 2, 3, 4), level = structure(c(1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("g01", "g02"), class = "factor"), 
    m_col = c(1, 2, 3, 4, 11, 22, 33, 44), u_col = c(11, 12, 
    13, 14, 21, 22, 23, 24), group = c(0, 0, 1, 1, 0, 0, 1, 1
    )), row.names = c(NA, -8L), class = "data.frame")

Which looks like this

  id level m_col u_col group
1  1   g01     1    11     0
2  2   g01     2    12     0
3  3   g01     3    13     1
4  4   g01     4    14     1
5  1   g02    11    21     0
6  2   g02    22    22     0
7  3   g02    33    23     1
8  4   g02    44    24     1

I want to perform a binomial weighted test on each 'level' (I need to compare u_col and m_col for each id, essentially) ... so using tidyverse and broom I can do the following:

res <- df %>% 
  group_by(level) %>% 
  do(tidy(glm(cbind(.$m_col,.$u_col) ~ .$group, family="binomial"))) %>%
  filter(term == ".$group")

Which gives me some p-values for each level:

> res
# A tibble: 2 x 6
# Groups:   level [2]
  level term    estimate std.error statistic p.value
  <fct> <chr>      <dbl>     <dbl>     <dbl>   <dbl>
1 g01   .$group    0.687     0.746     0.921  0.357 
2 g02   .$group    0.758     0.296     2.56   0.0105

I can then ask how many p<0.05

length(which(res$p.value < 0.05)

I would now like to permute the data, repeat the binomial test, ask how many p's < 0.05 and then store that value, and then repeat 999 more times.

HOWEVER, the permutation needs to shuffle the 'group' column within each 'level'. I'm struggling to find a way to do this, so for example one permutation would look like this

  id level m_col u_col group
1  1   g01     1    11     1
2  2   g01     2    12     0
3  3   g01     3    13     1
4  4   g01     4    14     0
5  1   g02    11    21     1
6  2   g02    22    22     0
7  3   g02    33    23     1
8  4   g02    44    24     0

A second would look like

  id level m_col u_col group
1  1   g01     1    11     0
2  2   g01     2    12     1
3  3   g01     3    13     1
4  4   g01     4    14     0
5  1   g02    11    21     0
6  2   g02    22    22     1
7  3   g02    33    23     1
8  4   g02    44    24     0

etc

Having the test rely on 2 columns limits the shuffle options and I'm stumped. I would appreciate any advice.


Solution

  • If you want a dataframe you may try this:

    library(tidyverse)
    map_dfr(1:1000, ~ df %>%
                       group_by(level) %>%
                       mutate(group = group[sample(row_number())]) %>% # permutation shuffle the 'group' column within each 'level'. 
                       do(tidy(glm(cbind(.$m_col,.$u_col) ~ .$group, family="binomial"))) %>%
                       filter(term == ".$group") %>% 
                       ungroup() %>% 
                       summarise(sum(p.value < 0.05))) # ask how many p<0.05
    

    and if you want a vector:

    map_dbl(1:1000, ~ df %>%
                       group_by(level) %>%
                       mutate(group = group[sample(row_number())]) %>% # permutation shuffle the 'group' column within each 'level'. 
                       do(tidy(glm(cbind(.$m_col,.$u_col) ~ .$group, family="binomial"))) %>%
                       filter(term == ".$group") %>% 
                       ungroup() %>% 
                       summarise(sum(p.value < 0.05)) %>% # ask how many p<0.05
                       pull())