Search code examples
rfor-loopapplycumsumsplit-apply-combine

Avoiding the use of for loop for cumsum


First generating some sample data:

 doy <- rep(1:365,times=2)
 year <- rep(2000:2001,each=365)
 set.seed(1)
 value <-runif(min=0,max=10,365*2)
 doy.range <- c(40,50,60,80)
 thres <- 200

 df <- data.frame(cbind(doy,year,value))

What I want to do is the following:

For the df$year == 2000, starting from doy.range == 40, start adding the df$value and calculate the df$doy when the cumualtive sum of df$value is >= thres

Here's my long for loop to achieve this:

# create a matrix to store results

 mat <- matrix(, nrow = length(doy.range)*length(unique(year)),ncol=3)
 mat[,1] <- rep(unique(year),each=4)
 mat[,2] <- rep(doy.range,times=2)

for(i in unique(df$year)){

     dat <- df[df$year== i,]

       for(j in doy.range){

         dat1 <- dat[dat$doy >= j,]
         dat1$cum.sum <-cumsum(dat1$value) 
         day.thres <- dat1[dat1$cum.sum >= thres,"doy"][1] # gives me the doy of the year where cumsum of df$value becomes >= thres
        mat[mat[,2] == j & mat[,1] == i,3] <- day.thres
  }
}

This loop gives me the in the third column of my matrix, the doy when cumsum$value exceeded thres

However, I really want to avoid the loops. Is there any way I can do it using less code?


Solution

  • If I understand correctly you can use dplyr. Assume a threshold of 200:

    library(dplyr)
    df %>% group_by(year) %>% 
      filter(doy >= 40) %>% 
      mutate(CumSum = cumsum(value)) %>% 
      filter(CumSum >= 200) %>% 
      top_n(n = -1, wt = CumSum)
    

    which yields

    # A tibble: 2 x 4
    # Groups:   year [2]
        doy  year    value   CumSum
      <dbl> <dbl>    <dbl>    <dbl>
    1    78  2000 3.899895 201.4864
    2    75  2001 9.205178 204.3171
    

    The verbs used are self-explanatory I guess. If not, let me know.

    For different doy create a function and use lapply:

    f <- function(doy.range) {
      df %>% group_by(year) %>% 
        filter(doy >= doy.range) %>% 
        mutate(CumSum = cumsum(value)) %>% 
        filter(CumSum >= 200) %>% 
        top_n(n = -1, wt = CumSum)
    }
    
    lapply(doy.range, f)
    
    [[1]]
    # A tibble: 2 x 4
    # Groups:   year [2]
        doy  year    value   CumSum
      <dbl> <dbl>    <dbl>    <dbl>
    1    78  2000 3.899895 201.4864
    2    75  2001 9.205178 204.3171
    
    [[2]]
    # A tibble: 2 x 4
    # Groups:   year [2]
        doy  year    value   CumSum
      <dbl> <dbl>    <dbl>    <dbl>
    1    89  2000 2.454885 200.2998
    2    91  2001 6.578281 200.6544
    
    [[3]]
    # A tibble: 2 x 4
    # Groups:   year [2]
        doy  year    value   CumSum
      <dbl> <dbl>    <dbl>    <dbl>
    1    98  2000 4.100841 200.5048
    2   102  2001 7.158333 200.3770
    
    [[4]]
    # A tibble: 2 x 4
    # Groups:   year [2]
        doy  year    value   CumSum
      <dbl> <dbl>    <dbl>    <dbl>
    1   120  2000 6.401010 204.9951
    2   120  2001 5.884192 200.8252