I have a question about using t.test to check if the population mean is bigger than another.
Imagine I have 2 variables in a dataframe d:
Weight: Numerical variable (weight of people).
Anykids: Categorical variable that can be yes or no.
The dataframe would be like:
Anykids Weight
yes 70
yes 84
no 66
... ..
I want to check if the mean of weight of people with anykids = yes is bigger than the one's with anykids = no. So I wold have:
H0: m(weight_yes) = m(weight_no)
H1: m(weight_yes) > m(weight_no)
The function would be:
t.test(weight~anykids, data = d, alternative = 'greater')
How the function knows that the parameter greater means the group with anykids = yes and not the group with anykids = no?
If I wanted to check the hypothesis:
H0: m(weight_no) = m(weight_yes)
H1: m(weight_no) > m(weight_yes)
The function would had the same parameters. How I know that greater means anykids = yes o anykids = no?
Like many things with factors, R chooses based on the order of the levels of the factor. In your case, you could check using levels(Anykids)
to discover in advance which one will be used as x vs. y in the t.test()
function, or potentially change the order with relevel()
.
But the t-test()
results will also just show you which one was considered. Here, in the iris dataset, the versicolor level comes first, and will be considered whether versicolor has a greater mean Sepal.Width than virginica.
levels(iris$Species)
#> [1] "setosa" "versicolor" "virginica"
test_data <- iris[iris$Species != 'setosa', ]
t.test(data = test_data, Sepal.Width ~ Species, alternative = "greater")
#>
#> Welch Two Sample t-test
#>
#> data: Sepal.Width by Species
#> t = -3.2058, df = 97.927, p-value = 0.9991
#> alternative hypothesis: true difference in means is greater than 0
#> 95 percent confidence interval:
#> -0.3096707 Inf
#> sample estimates:
#> mean in group versicolor mean in group virginica
#> 2.770 2.974