Search code examples
rloopsfor-loopsubset

Subset a data frame and then apply a mathematical question to each subset in a for loop in R


I've been struggling to write a for loop for this specific purpose, and i have quite a large data set so I'd like to learn how to do this in a for loop because the alternative is doing it manually.

To spare you the details about my specific data, I'll refer to this reproducible data frame as an example:

mydf<-data.frame(Factor = rep(c("level 1", "level 2", "level 3"), 4),
                  numeric = c(1:12))
 mydf
Factor numeric
1  level 1       1
2  level 2       2
3  level 3       3
4  level 1       4
5  level 2       5
6  level 3       6
7  level 1       7
8  level 2       8
9  level 3       9
10 level 1      10
11 level 2      11
12 level 3      12

I have a categorical (as a factor) and a numerical column, and I'd like to be able to perform calculations on the numerical data for individual levels within the factor, or at the very least to ask R what the mean or standard deviation is for each level. For example, what is the standard deviation of the numeric data belonging to "level 1"?

I haven't used for loops to subset/group data before so I don't even know where to start.

Thank you in advance to anyone who can help

I tried searching for an answer on stack overflow but couldn't find what i needed. I pulled together some ideas from a few different issues where people were needing to filter a data frame in a for loop. I tried this but it just printed out the entire data frame 12 times and then gave the standard deviation for the whole numeric column 12 times rather than for each level

    sd<-c()
    for (levels in mydf$Factor){
    a<-dplyr::filter(mydf, Factor == Factor)
    sd<-append(sd, sd(a$numeric))
    }
sd
     [1] 3.605551 3.605551 3.605551 3.605551 3.605551 3.605551 3.605551
     [8] 3.605551 3.605551 3.605551 3.605551 3.605551

I also tried this based on another answer I found here on stack overflow:

output<-rep(NA,3)
names(output)<-levels(mydf$Factor)
for (i in 1:length(output)){
  sd[i]<- mean(subset(mydf, Factor == levels(mydf$Factor)[i])$numeric)
}
sd

but this gave me:

sd
 [1] 5.500000 6.500000 7.500000 3.605551 3.605551 3.605551 3.605551
 [8] 3.605551 3.605551 3.605551 3.605551 3.605551

Solution

  • If I am understanding correctly what you want, this will do the work.

    library(dplyr)
    library(scales)
    
    mydf1<-mydf %>% group_by(Factor) %>% mutate(sd=sd(numeric),
                                            mean=mean(numeric),
                                            standardised=scale(numeric)) 
    

    You really do not need loops for this, learn dplyr (or data.table) instead, it will make your life much easier. Also, look into scales function, you can choose to center around 0 or not