Search code examples
rfor-loopdplyrsampleresampling

Resample and looping over dplyr functions in R


I have the following data-set (dat) with 8 unique treatment groups. I want to sample 3 points from each unique group and store their mean and variance. I want to do this 1000 times over (sample with replacement) using a loop to store all the values in output. I tried to do this loop and I keep running into unexpected '=' in:"output[i] <- summarise(group_by(new_df[i], fertilizer,crop, level),mean[i]="

Any suggestions on how to fix it, or make it more

fertilizer <- c("N","N","N","N","N","N","N","N","N","N","N","N","P","P","P","P","P","P","P","P","P","P","P","P","N","N","N","N","N","N","N","N","N","N","N","N","P","P","P","P","P","P","P","P","P","P","P","P")

crop <- c("alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group")

level <- c("low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","low")

growth <- c(0,0,1,2,90,5,2,5,8,55,1,90,2,4,66,80,1,90,2,33,56,70,99,100,66,80,1,90,2,33,0,0,1,2,90,5,2,2,5,8,55,1,90,2,4,66,0,0)

dat <- data.frame(fertilizer, crop, level, growth)

library(dplyr)

for(i in 1:1000){
  new_df[i] <- dat %>% 
                  group_by(fertilizer, crop, level) %>% 
                  sample_n(3)
  output[i] <- summarise(
                  group_by(new_df[i], fertilizer, crop, level),
                  mean[i] = mean(growth), 
                  var[i] = sd(growth) * sd(growth))
}

Solution

  • I don't think you need a loop. You can do this faster by sampling 3*1000 values per group at once, assign sample_id and add it to grouping variables, and finaly summarize to get desired values. This way you are calling all functions only once. -

    dat %>% 
      group_by(fertilizer, crop, level) %>% 
      sample_n(3*1000, replace = T) %>% 
      mutate(sample_id = rep(1:1000, each = 3)) %>% 
      group_by(sample_id, add = TRUE) %>% 
      summarise(
        mean = mean(growth, na.rm = T),
        var = sd(growth)^2
      ) %>% 
      ungroup()
    
    # A tibble: 8,000 x 6
       fertilizer crop  level sample_id  mean      var
       <chr>      <chr> <chr>     <int> <dbl>    <dbl>
     1 N          alone high          1 30.7  2640.   
     2 N          alone high          2  1       0    
     3 N          alone high          3 60.3  2640.   
     4 N          alone high          4  1.33    0.333
     5 N          alone high          5  1.33    0.333
     6 N          alone high          6 60.3  2640.   
     7 N          alone high          7  1.33    0.333
     8 N          alone high          8 30.3  2670.   
     9 N          alone high          9  1.33    0.333
    10 N          alone high         10 60.7  2581.   
    # ... with 7,990 more rows