Search code examples
rdplyrstatisticsanovat-test

Running multiple T-Test on variables with groupings in R (not using rstatix)


I've a data frame containing different items (and it's cost) and also it's subsequent groupings. I would like to run a T-Test for each item based on their groupings to see if their mean differs. Anybody knows how to do this in R without using the rstatix package? If possible, done in base R using lapply or looping. Tidyr and dplyr is fine.

A sample of the dataframe is as follow:

df = structure(list(Item = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("Book A", 
"Book B", "Book C", "Book D"), class = "factor"), Cost = c(7L, 
9L, 6L, 7L, 4L, 6L, 5L, 3L, 5L, 4L, 7L, 2L, 2L, 4L, 2L, 9L, 4L
), Grouping = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 
1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L), .Label = c("A", "B"), class = "factor")), class = "data.frame", row.names = c(NA, 
-17L))
Item Cost Grouping
Book A 7 A
Book A 9 B
Book A 6 A
Book A 7 B
Book B 4 A
Book B 6 B
Book B 5 A
Book B 3 A
Book C 5 B
Book C 4 A
Book C 7 A
Book C 2 B
Book C 2 B
Book D 4 A
Book D 2 B
Book D 9 B
Book D 4 A

The output should be a simple table (or any similar table) as follows

Item P-Value (H0: Mean of group A = Mean of group B)
Book A xxx
Book B xxx
Book C xxx
Book D xxx

Using the rstatix package, the code will be (credits: Quinten)

library(dplyr)
library(rstatix)
df %>% 
  group_by(Item) %>%
  t_test(Cost ~ Grouping)

I would like to achieve the same output but without using rstatix package as I often encounter issues with the broom package (dependent package of rstatix). Base package would be fine as I code with my phone sometimes.

Thank you!


Solution

  • The error relates to the number of observations in 'Grouping'. There is a case of having 1 observation. With base R, we can do this as

    lapply(split(df, df$Item), function(x) if(any(table(x$Grouping) < 2)) 
          NA else t.test(Cost ~ Grouping, data = x))
    

    -output

    $`Book A`
    
        Welch Two Sample t-test
    
    data:  Cost by Grouping
    t = -1.3416, df = 1.4706, p-value = 0.3499
    alternative hypothesis: true difference in means between group A and group B is not equal to 0
    95 percent confidence interval:
     -8.418523  5.418523
    sample estimates:
    mean in group A mean in group B 
                6.5             8.0 
    
    
    $`Book B`
    [1] NA
    
    $`Book C`
    
        Welch Two Sample t-test
    
    data:  Cost by Grouping
    t = 1.3868, df = 1.8989, p-value = 0.3059
    alternative hypothesis: true difference in means between group A and group B is not equal to 0
    95 percent confidence interval:
     -5.666332 10.666332
    sample estimates:
    mean in group A mean in group B 
                5.5             3.0 
    
    
    $`Book D`
    
        Welch Two Sample t-test
    
    data:  Cost by Grouping
    t = -0.42857, df = 1, p-value = 0.7422
    alternative hypothesis: true difference in means between group A and group B is not equal to 0
    95 percent confidence interval:
     -45.97172  42.97172
    sample estimates:
    mean in group A mean in group B 
                4.0             5.5 
    

    Or getting the pvalue

    stack(lapply(split(df, df$Item), function(x) if(any(table(x$Grouping) < 2)) 
          NA else t.test(Cost ~ Grouping, data = x)$p.value))[2:1]
      ind    values
    1 Book A 0.3498856
    2 Book B        NA
    3 Book C 0.3058987
    4 Book D 0.7422379
    

    The same approach can be done with dplyr

    library(dplyr)
    df %>% 
      add_count(Item, Grouping) %>%
      group_by(Item) %>%
       summarise(out = list(if(any(n < 2)) NA else t.test(Cost ~ Grouping)))
    

    -output

    # A tibble: 4 × 2
      Item   out      
      <fct>  <list>   
    1 Book A <htest>  
    2 Book B <lgl [1]>
    3 Book C <htest>  
    4 Book D <htest>  
    

    If it needs only the pvalue

    df %>% 
      add_count(Item, Grouping) %>%
      group_by(Item) %>%
       summarise(out = if(any(n < 2)) NA_real_ else t.test(Cost ~ Grouping)$p.value)
    # A tibble: 4 × 2
      Item      out
      <fct>   <dbl>
    1 Book A  0.350
    2 Book B NA    
    3 Book C  0.306
    4 Book D  0.742