Search code examples
rloopssubsetdata-management

How to loop though columns and subset the dataset according to the same value


I am attempting to loop through columns and subset data with the same value.

See Below.

White <- rep(0:1, 50)
Latino <- rep(0:1, 50)
Black <- rep(0:1, 50)
Asian <- rep(0:1, 50)
DV <- seq(1: length(rep(0:1, 50)))
x <- data.frame(cbind(White, Latino, Black, Asian, DV))


race <- c("White", "Latino", "Black", "Asian")

for(j in race){
  for (i in race){

    df_1 <- subset(x, i == 1)
    df_2 <- subset(x, j == 1)
    print(paste(i, j, sep = " "))
    print(t.test(df_1$DV, df_2$DV) )


  }
}

Unfortunately, r does not like the i or j to stand alone. If anyone knows a better way of looping through columns to subset the same value, It would be much appreciated. Thank you


Solution

  • Note that i and j in your code is a string, but actually you wanted to extract that column, like

    for(j in race){
      for (i in race){
    
        df_1 <- subset(x, x[,i] == 1)
        df_2 <- subset(x, x[,j] == 1)
        print(paste(i, j, sep = " "))
        print(t.test(df_1$DV, df_2$DV) )
    
    
      }
    }
    

    With regarding to a better way of looping, it seems the dummy variable White, Latino, Black and Asian is mutually exclusive, therefore, perhaps we could rearrange data into

          race  DV
       ------------
    1    Black   1
    2    White   2
    3   Latino   3
    4    Black   4
    5    Asian   5
    

    and invoke t.test with formula, like

    # generate synthetic data
    rnd.race <- sample(1:4, 50, replace=T)
    x <- data.frame(
      White = as.integer(rnd.race == 1),
      Latino = as.integer(rnd.race == 2),
      Black = as.integer(rnd.race == 3),
      Asian = as.integer(rnd.race == 4),
      DV = seq(1: length(rep(0:1, 50)))
    )
    
    race <- c("White", "Latino", "Black", "Asian")
    
    # rearrange data, gather columns of dummy variables
    x.cleaned = data.frame(
      race = race[apply(x[,1:4], 1, which.max)],
      DV = x$DV
    )
    
    t.test( DV ~ race, data=x.cleaned, race %in% c("White", "Black"))
    
    # 
    #     Welch Two Sample t-test
    # 
    # data:  DV by race
    # t = -0.91517, df = 42.923, p-value = 0.3652
    # alternative hypothesis: true difference in means is not equal to 0
    # 95 percent confidence interval:
    #  -25.241536   9.483961
    # sample estimates:
    # mean in group Black mean in group White 
    #            47.66667            55.54545 
    # 
    

    The eensy benefit of using t.test with formula is its readability. For example, in the report of t.test, instead of mean in group x and mean in group y, it will say mean in group Black, mean in group White, and the formula itself states the variable at which we are testing covariant against.

    To run t-test iteratively across all pairs, we could

    run.test = function(race.pair) {
        list(t.test(DV ~ race, data=x.cleaned, race %in% race.pair) )
    }
    
    combn(race, 2, FUN = run.test)
    
    # [[1]]
    # 
    #     Welch Two Sample t-test
    # 
    # data:  DV by race
    # t = -0.30892, df = 41.997, p-value = 0.7589
    # alternative hypothesis: true difference in means is not equal to 0
    # 95 percent confidence interval:
    #  -21.22870  15.59233
    # sample estimates:
    # mean in group Latino  mean in group White 
    #             52.72727             55.54545 
    # 
    # 
    # [[2]]
    # 
    #     Welch Two Sample t-test
    # 
    # data:  DV by race
    # t = -0.91517, df = 42.923, p-value = 0.3652
    # alternative hypothesis: true difference in means is not equal to 0
    # 95 percent confidence interval:
    #  -25.241536   9.483961
    # sample estimates:
    # mean in group Black mean in group White 
    #            47.66667            55.54545 
    # 
    # ...
    

    where combn(x, m, FUN = NULL, simplify = TRUE, ...) is a builtin to generate all combinations of the elements of x taken m at a time. For a more generate case using outer, see @askrun's answer.


    Finally, IMHO, perhaps ANOVA is more widely recognized than t-test when comparing means between three or more groups (may also suggest why it is "inconvenient" to use t-test iteratively over pairs of groups).

    With x.cleaned, we can easily use ANOVA in R, like:

    aov.out = aov(DV ~ race, data=x.cleaned)
    summary(aov.out)
    

    Note that after one-way ANOVA (test if some of the group means are different), we may also run Post Hoc tests (like TukeyHSD(aov.out)) to find out specific pairs of group has different means. A few tests of assumptions are also de rigueur in a formal report. Here is a lecture notes related to this. And this is a related question on Cross-Validated (where further questions on which test to choose could be answered).