Search code examples
rdynamicdplyracross

R dplyr across: Dynamically specifying arguments to functions t.test and varTest


Am writing some dplyr across statements. Want to create some p-values using the functions t.test and varTest. The x= columns for calculations are in df_vars and the mu= and sigma.squared= parameter values are in df_mu_sigma.

A hard-coded version of the data I need are in df_sumry. If the variable names were always the same when code is run, something like this would suffice. That's not the case, however.

The beginnings of a non-hard-coded version of what I need are in df_sumry2. That doesn't yield a correct result yet though, because values of mu= and sigma.squared= are not dynamically specified. Only the first two p-values are correct in df_sumry2. They are always wrong after that because the code always uses values for the mpg variable.

How can I consistently get the right values inserted for mu and sigma.squared?

library(dplyr)
library(magrittr)
library(EnvStats)

df_vars <- mtcars %>%
  select(mpg, cyl, disp, hp)

set.seed(9302)

df_mu_sigma <- mtcars %>%
  select(mpg, cyl, disp, hp) %>%
  slice_sample(n = 12) %>%
  summarize(
    across(
      everything(),
      list(mean = mean,
           std = sd
      ))
  )

df_sumry <- df_vars %>%
  summarize(
    mpg_mean = mean(mpg),
    mpg_mean_prob = t.test(mpg, mu = df_mu_sigma$mpg_mean)$p.value,
    mpg_std = sd(mpg),
    mpg_std_prob = varTest(mpg, sigma.squared = df_mu_sigma$mpg_std^2)$p.value,
 
    cyl_mean = mean(cyl),
    cyl_mean_prob = t.test(cyl, mu = df_mu_sigma$cyl_mean)$p.value,
    cyl_std = sd(cyl),
    cyl_std_prob = varTest(cyl, sigma.squared = df_mu_sigma$cyl_std^2)$p.value,

    disp_mean = mean(disp),
    disp_mean_prob = t.test(disp, mu = df_mu_sigma$disp_mean)$p.value,
    disp_std = sd(disp),
    disp_std_prob = varTest(disp, sigma.squared = df_mu_sigma$disp_std^2)$p.value,
 
    hp_mean = mean(hp),
    hp_mean_prob = t.test(hp, mu = df_mu_sigma$hp_mean)$p.value,
    hp_std = sd(hp),
    hp_std_prob = varTest(hp, sigma.squared = df_mu_sigma$hp_std^2)$p.value
   )

vars_num <- names(df_vars)

df_sumry2 <- df_vars %>%
  summarize(
    across(
      all_of(vars_num),
      list(mean = mean,
           mean_prob = function(x) t.test(x, mu = df_mu_sigma$mpg_mean)$p.value,
           std = sd,
           std_prob = function(x) varTest(x, sigma.squared = df_mu_sigma$mpg_std^2)$p.value)
    )
  )


Solution

  • This is not much better than your solution, but I would use cur_column() instead of ensym() to avoid quosures handling.

    Also, putting the query in a separate function makes things a bit tidier.

    Finally, I would use lambda functions instead of anonymous functions for clarity.

    get_mu = function(suffix){
      df_mu_sigma[[paste0(cur_column(), suffix)]] #you could use glue() as well here
    }
    
    df_vars %>%
      summarize(
        across(
          all_of(vars_num),
          list(
            mean = mean,
            mean_prob = ~t.test(.x, mu = get_mu("_mean"))$p.value,
            std = sd,
            std_prob = ~varTest(.x, sigma.squared = get_mu("_std")^2)$p.value
          )
        )
      ) %>% t() #just to format the output
    
    
    #                        [,1]
    # mpg_mean        20.09062500
    # mpg_mean_prob    0.01808550
    # mpg_std          6.02694805
    # mpg_std_prob     0.96094601
    # cyl_mean         6.18750000
    # cyl_mean_prob    0.10909740
    # cyl_std          1.78592165
    # cyl_std_prob     0.77092484
    # disp_mean      230.72187500
    # disp_mean_prob   0.17613878
    # disp_std       123.93869383
    # disp_std_prob    0.96381507
    # hp_mean        146.68750000
    # hp_mean_prob     0.03914858
    # hp_std          68.56286849
    # hp_std_prob      0.03459963