Search code examples
rdataframeloopsnestedanova

Create a data frame with unique combo generated from nested for loops


I have a data frame as such:

  Feature ID Sub Value
1       A T1  B1  5.87
2       B T1  B2  3.99
3       C T1  B3 12.57
4       A T1  B2  9.22
5       B T1  B3  7.89
6       C T1  B1  4.76
7       A T2  B1  4.56
8       B T2  B2  9.26
9       C T2  B2  7.44

What I want to do is run one factor ANOVA in this dataset with the factor being "Sub". I want to loop through each feature and loop through each ID. Basically, I am calculating the variance of each feature within an ID, between "Sub".

I have generated the below code, but it doesn't seem to be working.

datalist <- list()

for (i in unique(data1$Feature)) {
  for (j in unique(data1$ID)) {
    A1 <- summary(aov(data1$value ~ as.factor(data1$Sub), data=data1))
    datalist[[j]] <- A1
  }
}

big_data <- do.call(rbind, datalist)

I end up getting big_data which is a matrix of 36 lists. I am unable to access the Anova output. It doesn't have to necessarily be a data frame. Even if it's a "write.csv()" within the loop that will generate the different outputs. Ultimately, I'll just be needing the "between" factor parameter of the Anova output to generate a plot so if this can also be incorporated in the code that'd be of great help.


Solution

  • Several issues with current setup:

    • You do not actually use i and j in your anova call, so all nested for loop iterations will return exact same results run on entire data frame. Quick Fix: subset data frame by i-th and j-th values.

      anova(value ~ Sub, data = subset(data1, Feature == i & ID == j))
      
    • You save list elements only under j values and not both i and j, so iterations will reassign repeatedly and only saves last pass of j items. Quick fix: add named elements of i-th and j-th values.

      datalist[[paste0(i, "_", j)]] <- A1
      
    • You are attempting to rbind list objects, not matrices or data frames, since summary.anova returns a list of results. For your use case, calling str shows your results contain a list of 1:

      str(summary(aov(data1$value ~ as.factor(data1$Sub), data = data1)))
      List of 1
      $ :Classes ‘anova’ and 'data.frame': 2 obs. of  5 variables:
        ..$ Df     : num [1:2] ...
        ..$ Sum Sq : num [1:2] ...
        ..$ Mean Sq: num [1:2] ...
        ..$ F value: num [1:2] ...
        ..$ Pr(>F) : num [1:2] ...
      - attr(*, "class")= chr [1:2] "summary.aov" "listof"
      

      Quick fix: index the first item.

      summary(anova(...))[[1]]
      

    However, consider an apply family solution with by (object-oriented wrapper to tapply) and avoid the bookkeeping of initializing lists and assign iteratively in nested for loops. Specifically, by can split up data frame by one or more groups and run operations on the subsets to return a list equal to all possible unique values of groups. Also, consider using a defined method to encapsulate all processing on each subset.

    # USER-DEFINED METHOD
    run_anova <- function(sub_df) {
      # RAW RESULTS
      anova_raw <- summary(aov(value ~ Sub, data = sub_df))[[1]]
    
      # CLEAN UP DATA WITH IDENTIFIERS
      anova_df <- data.frame(
        within(anova_raw, {Feature <- sub_df$Feature[1]; ID <- sub_df$ID[1]}),
        row.names = NULL,
        check.names = FALSE
      )
      
      return(anova_df)
    }
    
    datalist <- by(data1, data1[c("Feature", "ID")], run_anova)
      
    big_data <- do.call(rbind, unname(datalist))