Search code examples
rrstatix

rstatix::wilcox_test doesn't take formula as variable


I am trying to do boxplots and add p-values from rstatix::wilcox_test(). But I want to do this in a loop. so the column names for the formula for wilcox_test is stored in a variable. When I use the variable name in the wilcox_test, it throws error

Error in `pull()`:
! Can't extract columns that don't exist.
✖ Column `rs_id` doesn't exist.

Can someone please help me how to use variables in the formula for wilcox_test()

here is my code

rs_ids <- c('rs2160669', 'rs964184')
phenos <- c('HBA1C', 'TG')
for (rs_id in rs_ids){
  for(phen in phenos){
    bxp <- ggboxplot(d.k, x = rs_id, y = phen)
    stat.test <- d.k %>%  wilcox_test(.data[[`phen`]] ~ .data[[`rs_id`]]) %>%  add_significance()
    stat.test <- stat.test %>% add_xy_position(x = .data[[rs_id]]) %>% mutate(myformatted.p = ifelse(p < 0.001, 'p<0.001', paste0('p=',signif(p, digits = 2))))
  
    bxp <-  bxp +   geom_jitter(position=position_jitter(0.2), alpha = 0.5)   +  theme_bw() + theme(legend.position="none", axis.text=element_text(size=12), axis.title=element_text(size=14,face="bold")) + xlab(rs_id) + ylab(phen)
    plot <- bxp + stat_pvalue_manual(stat.test, label = "{myformatted.p}", tip.length = 0.01, step.increase = 0.1)
  }
}

sample data is below

structure(list(HBA1C = c(4.5, 5.9, 8.02, 5.5, 5.4, 6, 7.1, 9.9, 
5.58, 5, 7.6, 8, 5.5, 6.4, 6.8, 9, 6.2, 4.9, 5.7, 6.2, 6.3, 5.7, 
5.9, 6.9, 7.3, 5.7, 7.2, 10.4, 5.4, 5.3, 4.8, 5.5, 7.1, 6.2, 
7, 7.4, 11.4, 5.4, 5.5, 5.3, 5.5, 5.9, 5.5, 5.79, 4.8, 6.16, 
5.63, 7.41, 7.68, 5.82), TG = c(0.83, 0.95, 2.48, 0.78, 0.48, 
1.16, 1.45, 1.32, 1.03, 3.77, 3.87, 2.34, 1.98, 2.59, 2.92, 2.71, 
6.88, 1.25, 3.68, 1.15, 2.33, 3.37, 0.61, 1.02, 1.63, 1.32, 1.21, 
1.35, 0.85, 3.97, 4.04, 3.63, 2.62, 1.46, 2.33, 2.46, 1.09, 1.46, 
2.77, 3.1, 3.13, 2.55, 1.91, 0.97, 0.87, 1.46, 1.45, 1.15, 2.61, 
2.15), rs2160669 = c(3, 2, 3, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 2, 
1, 2, 3, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 
2, 3, 3, 3, 2, 3, NA, 3, 3, 3, 3, 3, 3, 2, 2), rs964184 = c(1, 
2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 3, 2, 1, 1, 1, 1, 2, 2, 
1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, NA, 
2, 1, 3, 1, 1, 2, 2, 2)), row.names = c(NA, -50L), class = c("tbl_df", 
"tbl", "data.frame"))

Solution

  • A few issues here:

    1. Backticks are wrong, even in dplyr verbs that recognize .data you'd need to use .data[[phen]] and .data[[rs_id]]. (C.f., d.k %>% count(.data[[phen]])).

    2. .data can only be used in tidyverse-aware functions (e.g., dplyr), it won't work in wilcox_test.

    I suggest you make the formula as a string and then as.formula it. This will work in the %>%-pipe since the first argument to the test is data=, so the formula (as we form it) will be the second arg, formula=.

    rs_id <- "rs2160669"
    phen <- "HBA1C"
    as.formula(paste(phen, rs_id, sep = "~"))
    # HBA1C ~ rs2160669
    
    library(dplyr) # just for %>% which is magrittr, but I'm inferring your use of dplyr
    library(rstatix)
    d.k %>%
      wilcox_test(as.formula(paste(phen, rs_id, sep = "~")))
    # # A tibble: 3 x 9
    #   .y.   group1 group2    n1    n2 statistic     p p.adj p.adj.signif
    # * <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl> <chr>       
    # 1 HBA1C 1      2          1    12         7 0.923  1    ns          
    # 2 HBA1C 1      3          1    36        25 0.542  1    ns          
    # 3 HBA1C 2      3         12    36       280 0.13   0.39 ns