I am attempting to loop through columns and subset data with the same value.
See Below.
White <- rep(0:1, 50)
Latino <- rep(0:1, 50)
Black <- rep(0:1, 50)
Asian <- rep(0:1, 50)
DV <- seq(1: length(rep(0:1, 50)))
x <- data.frame(cbind(White, Latino, Black, Asian, DV))
race <- c("White", "Latino", "Black", "Asian")
for(j in race){
for (i in race){
df_1 <- subset(x, i == 1)
df_2 <- subset(x, j == 1)
print(paste(i, j, sep = " "))
print(t.test(df_1$DV, df_2$DV) )
}
}
Unfortunately, r does not like the i or j to stand alone. If anyone knows a better way of looping through columns to subset the same value, It would be much appreciated. Thank you
Note that i
and j
in your code is a string, but actually you wanted to extract that column, like
for(j in race){
for (i in race){
df_1 <- subset(x, x[,i] == 1)
df_2 <- subset(x, x[,j] == 1)
print(paste(i, j, sep = " "))
print(t.test(df_1$DV, df_2$DV) )
}
}
With regarding to a better way of looping, it seems the dummy variable White
, Latino
, Black
and Asian
is mutually exclusive, therefore, perhaps we could rearrange data into
race DV
------------
1 Black 1
2 White 2
3 Latino 3
4 Black 4
5 Asian 5
and invoke t.test
with formula, like
# generate synthetic data
rnd.race <- sample(1:4, 50, replace=T)
x <- data.frame(
White = as.integer(rnd.race == 1),
Latino = as.integer(rnd.race == 2),
Black = as.integer(rnd.race == 3),
Asian = as.integer(rnd.race == 4),
DV = seq(1: length(rep(0:1, 50)))
)
race <- c("White", "Latino", "Black", "Asian")
# rearrange data, gather columns of dummy variables
x.cleaned = data.frame(
race = race[apply(x[,1:4], 1, which.max)],
DV = x$DV
)
t.test( DV ~ race, data=x.cleaned, race %in% c("White", "Black"))
#
# Welch Two Sample t-test
#
# data: DV by race
# t = -0.91517, df = 42.923, p-value = 0.3652
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -25.241536 9.483961
# sample estimates:
# mean in group Black mean in group White
# 47.66667 55.54545
#
The eensy benefit of using t.test
with formula is its readability. For example, in the report of t.test
, instead of mean in group x
and mean in group y
, it will say mean in group Black
, mean in group White
, and the formula itself states the variable at which we are testing covariant against.
To run t-test iteratively across all pairs, we could
run.test = function(race.pair) {
list(t.test(DV ~ race, data=x.cleaned, race %in% race.pair) )
}
combn(race, 2, FUN = run.test)
# [[1]]
#
# Welch Two Sample t-test
#
# data: DV by race
# t = -0.30892, df = 41.997, p-value = 0.7589
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -21.22870 15.59233
# sample estimates:
# mean in group Latino mean in group White
# 52.72727 55.54545
#
#
# [[2]]
#
# Welch Two Sample t-test
#
# data: DV by race
# t = -0.91517, df = 42.923, p-value = 0.3652
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -25.241536 9.483961
# sample estimates:
# mean in group Black mean in group White
# 47.66667 55.54545
#
# ...
where combn(x, m, FUN = NULL, simplify = TRUE, ...)
is a builtin to generate all combinations of the elements of x
taken m
at a time. For a more generate case using outer
, see @askrun's answer.
Finally, IMHO, perhaps ANOVA is more widely recognized than t-test when comparing means between three or more groups (may also suggest why it is "inconvenient" to use t-test iteratively over pairs of groups).
With x.cleaned
, we can easily use ANOVA in R, like:
aov.out = aov(DV ~ race, data=x.cleaned)
summary(aov.out)
Note that after one-way ANOVA (test if some of the group means are different), we may also run Post Hoc tests (like TukeyHSD(aov.out)
) to find out specific pairs of group has different means. A few tests of assumptions are also de rigueur in a formal report. Here is a lecture notes related to this. And this is a related question on Cross-Validated (where further questions on which test to choose could be answered).