Search code examples
rstatisticsnat-testhypothesis-test

R Continue t.test in a map-function, although there are not enough observations


In my example data I have 3 dataframes. Every df has 2 variables (varA and varB) per threshold. There are 3 thresholds (1, 2, 3):

df1 <- tibble(
var1A= rnorm(1:10) +1,
var1B= rnorm(1:10) +1,
var2A= rnorm(1:10) +2,
var2B= rnorm(1:10) +2,
var3A= rnorm(1:10) +3,
var3B= rnorm(1:10) +3)


df2 <- tibble(
var1A= rnorm(1:10) +1,
var1B= rnorm(1:10) +1,
var2A= rnorm(1:10) +2,
var2B= rnorm(1:10) +2,
var3A= rnorm(1:10) +3,
var3B= rnorm(1:10) +3)


df3 <- tibble(
var1A= rnorm(1:10) +1,
var1B= NA,
var2A= rnorm(1:10) +2,
var2B= rnorm(1:10) +2,
var3A= rnorm(1:10) +3,
var3B= rnorm(1:10) +3)

Now I want to perform a t.test for each variables t.test(varA, varB) and for each threshold (1, 2, 3). Since I have more than 1 df, I put all df's in a map function and map the t.test for all df's and apply the t.test for all thresholds:

thresholds = c(1, 2, 3)

list_dfs = c('df1','df2','df3')

map(list_dfs,
function(df_name){
  x <- get(df_name)
  lapply(thresholds, function(i){
    t.test(x %>%
             pull(paste0("var",i,"A")), 
           x %>% 
             pull(paste0("var",i,"B")))
  }) %>% 
    map_df(broom::tidy) %>% 
    add_column(.before = 'estimate',
               df = df_name, 
               threshold = thresholds)
}) %>% 
do.call(rbind, .)

This code will map all results in one df. But the problem ist that var1B in df3 is empty. The whole column is NA.

How can I perform the map-function, although there are not enough observations for var1B? Here is my desired output:

# A tibble: 9 x 12
  df    threshold estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method
  <chr>     <dbl>    <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl> <chr> 
1 df1           1  -0.582      0.992     1.57    -1.43     0.170      16.6   -1.44      0.276 Welch~
2 df1           2   0.271      2.75      2.48     0.654    0.522      17.8   -0.601     1.14  Welch~
3 df1           3  -0.250      3.12      3.37    -0.544    0.593      17.7   -1.22      0.716 Welch~
4 df2           1  -0.169      0.747     0.916   -0.407    0.690      15.3   -1.05      0.714 Welch~
5 df2           2   0.0259     1.94      1.91     0.0702   0.945      17.9   -0.748     0.800 Welch~
6 df2           3   0.496      3.28      2.79     1.11     0.281      17.5   -0.444     1.44  Welch~
7 df3           1   NA         NA        NA       NA       NA         NA      NA        NA    NA   
8 df3           2  -0.274      1.99      2.26    -0.650    0.525      15.8   -1.17      0.622 Welch~
9 df3           3   0.407      3.34      2.93     0.920    0.371      16.6   -0.529     1.34  Welch~

Because varB for threshold 1 in df3 ist NA the row 7 in the output is also NA


Solution

  • What I would do is combine the data.frames in a different format - so that the "A" parts are in one data.frame and "B" parts - in the other:

    dfs <- cbind(df1=df1, df2=df2, df3=df3)
    dfA <- dfs[,grep("A$", colnames(dfs))]
    dfB <- dfs[,grep("B$", colnames(dfs))]
    

    Then everything is a lot easier:

    doTtest <- function(x, y) {
      if(any(!is.na(x)) & any(!is.na(y)))
        broom::tidy(t.test(x,y))
      else
        rep(NA, 10)
    }
    res <- as.data.frame(t(mapply(doTtest, dfA, dfB)))
    

    Alternatively you could the use a convenient library matrixTests:

    library(matrixTests)
    > col_t_welch(dfA, dfB)
              obs.x obs.y obs.tot    mean.x    mean.y   mean.diff     var.x     var.y    stderr       df  statistic     pvalue   conf.low conf.high alternative mean.null conf.level
    df1.var1A    10    10      20 1.5436119 0.7488449  0.79476695 0.2993602 0.5481971 0.2911284 16.57158  2.7299537 0.01449227  0.1793279 1.4102060   two.sided         0       0.95
    df1.var2A    10    10      20 2.2205661 2.2320260 -0.01145988 0.4832561 0.5249799 0.3175273 17.96923 -0.0360910 0.97160771 -0.6786419 0.6557222   two.sided         0       0.95
    df1.var3A    10    10      20 3.0457651 2.7835908  0.26217424 1.2998193 1.9933106 0.5738580 17.23565  0.4568626 0.65347516 -0.9473005 1.4716490   two.sided         0       0.95
    df2.var1A    10    10      20 1.7233471 1.2761199  0.44722715 0.9328694 1.3631385 0.4791668 17.38932  0.9333434 0.36342238 -0.5620050 1.4564593   two.sided         0       0.95
    df2.var2A    10    10      20 1.9278754 2.6368740 -0.70899858 1.0966493 0.6907785 0.4227798 17.11741 -1.6769925 0.11170922 -1.6005202 0.1825230   two.sided         0       0.95
    df2.var3A    10    10      20 3.1245106 2.9569952  0.16751542 1.0357228 0.8209887 0.4308958 17.76242  0.3887609 0.70207375 -0.7386317 1.0736625   two.sided         0       0.95
    df3.var1A    10     0      10 0.6804275       NaN         NaN 0.6015624 0.0000000       NaN      NaN         NA         NA         NA        NA   two.sided         0       0.95
    df3.var2A    10    10      20 2.0143381 1.9223843  0.09195379 0.7837613 0.7611496 0.3930535 17.99614  0.2339472 0.81766669 -0.7338338 0.9177413   two.sided         0       0.95
    df3.var3A    10    10      20 3.0156624 3.2768350 -0.26117263 1.5437758 1.2608029 0.5295827 17.81860 -0.4931668 0.62791751 -1.3745971 0.8522518   two.sided         0       0.95