I've a data frame containing different items (and it's cost) and also it's subsequent groupings. I would like to run a T-Test for each item based on their groupings to see if their mean differs. Anybody knows how to do this in R without using the rstatix package? If possible, done in base R using lapply or looping. Tidyr and dplyr is fine.
A sample of the dataframe is as follow:
df = structure(list(Item = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("Book A",
"Book B", "Book C", "Book D"), class = "factor"), Cost = c(7L,
9L, 6L, 7L, 4L, 6L, 5L, 3L, 5L, 4L, 7L, 2L, 2L, 4L, 2L, 9L, 4L
), Grouping = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L,
1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L), .Label = c("A", "B"), class = "factor")), class = "data.frame", row.names = c(NA,
-17L))
Item | Cost | Grouping |
---|---|---|
Book A | 7 | A |
Book A | 9 | B |
Book A | 6 | A |
Book A | 7 | B |
Book B | 4 | A |
Book B | 6 | B |
Book B | 5 | A |
Book B | 3 | A |
Book C | 5 | B |
Book C | 4 | A |
Book C | 7 | A |
Book C | 2 | B |
Book C | 2 | B |
Book D | 4 | A |
Book D | 2 | B |
Book D | 9 | B |
Book D | 4 | A |
The output should be a simple table (or any similar table) as follows
Item | P-Value (H0: Mean of group A = Mean of group B) |
---|---|
Book A | xxx |
Book B | xxx |
Book C | xxx |
Book D | xxx |
Using the rstatix package, the code will be (credits: Quinten)
library(dplyr)
library(rstatix)
df %>%
group_by(Item) %>%
t_test(Cost ~ Grouping)
I would like to achieve the same output but without using rstatix package as I often encounter issues with the broom package (dependent package of rstatix). Base package would be fine as I code with my phone sometimes.
Thank you!
The error relates to the number of observations in 'Grouping'. There is a case of having 1 observation. With base R
, we can do this as
lapply(split(df, df$Item), function(x) if(any(table(x$Grouping) < 2))
NA else t.test(Cost ~ Grouping, data = x))
-output
$`Book A`
Welch Two Sample t-test
data: Cost by Grouping
t = -1.3416, df = 1.4706, p-value = 0.3499
alternative hypothesis: true difference in means between group A and group B is not equal to 0
95 percent confidence interval:
-8.418523 5.418523
sample estimates:
mean in group A mean in group B
6.5 8.0
$`Book B`
[1] NA
$`Book C`
Welch Two Sample t-test
data: Cost by Grouping
t = 1.3868, df = 1.8989, p-value = 0.3059
alternative hypothesis: true difference in means between group A and group B is not equal to 0
95 percent confidence interval:
-5.666332 10.666332
sample estimates:
mean in group A mean in group B
5.5 3.0
$`Book D`
Welch Two Sample t-test
data: Cost by Grouping
t = -0.42857, df = 1, p-value = 0.7422
alternative hypothesis: true difference in means between group A and group B is not equal to 0
95 percent confidence interval:
-45.97172 42.97172
sample estimates:
mean in group A mean in group B
4.0 5.5
Or getting the pvalue
stack(lapply(split(df, df$Item), function(x) if(any(table(x$Grouping) < 2))
NA else t.test(Cost ~ Grouping, data = x)$p.value))[2:1]
ind values
1 Book A 0.3498856
2 Book B NA
3 Book C 0.3058987
4 Book D 0.7422379
The same approach can be done with dplyr
library(dplyr)
df %>%
add_count(Item, Grouping) %>%
group_by(Item) %>%
summarise(out = list(if(any(n < 2)) NA else t.test(Cost ~ Grouping)))
-output
# A tibble: 4 × 2
Item out
<fct> <list>
1 Book A <htest>
2 Book B <lgl [1]>
3 Book C <htest>
4 Book D <htest>
If it needs only the pvalue
df %>%
add_count(Item, Grouping) %>%
group_by(Item) %>%
summarise(out = if(any(n < 2)) NA_real_ else t.test(Cost ~ Grouping)$p.value)
# A tibble: 4 × 2
Item out
<fct> <dbl>
1 Book A 0.350
2 Book B NA
3 Book C 0.306
4 Book D 0.742