Search code examples
rperformanceloopssequential

How to speed up an R loop with sequential operations


I have a model which has multiple conditions and returns a value which it depends on for next prediction. Lets say given a time serie of A and B, the model returns a value of C variable, which in turn is used to estimate a value of D. In the next iteration along the new A and B, the model also uses estimated D as input:

df = data.frame(A = sample(-5:5, 10000, replace = TRUE),
                 B = sample(-5:5, 10000, replace = TRUE),
                 C = 0,
                D=0)

for(i in 1:nrow(df)){
  
    if (df$A[i]< 0 & df$B[i]>0){     
      df$C[i]<-df$B[i]
    
      } else if(df$A[i]==0  & df$B[i]==0 ){ 
      df$C[i]<-0
    
      }  else {
      df$C[i]<-df$A[i]+df$B[i]-df$D[i]  
        }
    
    df$D[i+1]<-ifelse(df$D[i]<=-df$C[i],0,df$D[i]+df$C[i]) # this is a cumulative sum-reset function
    
}

Though the code works well, it is very slow since I have hundred thousands of observations. I would appreciate for any suggestion that could speed it up.


Solution

  • Since each row is dependent on the result of the previous row, this is difficult to write in such a way that one can take advantage of R's vectorization. In cases like this, we get a massive advantage in writing the code in Rcpp.

    library(Rcpp)
    
    cppFunction('
    
    DataFrame f_Rcpp(DataFrame df) {
    
      NumericVector A = df["A"];
      NumericVector B = df["B"];
      NumericVector C = df["C"];
      NumericVector D = df["D"];
    
      for(int i = 0; i < (df.nrows() - 1); ++i) {
        
        if (A[i] < 0 && B[i] > 0) {     
          C[i] = B[i];
          
        } else if(A[i] == 0 && B[i] == 0 ) { 
          C[i] = 0;
          
        }  else {
          C[i] = A[i] + B[i] - D[i];
        }
        
        if(D[i] <= -C[i]) {
        D[i+1] = 0;
        } else {
        D[i+1] = D[i] + C[i]; 
        }
      }
      return(df);
    }
                
    ')
    

    If we wrap your own code as a function so we can compare it, we see that our Rcpp function gives the same results:

    f_R <- function(df) {
      for(i in 1:(nrow(df) - 1)) {
        
        if (df$A[i] < 0 & df$B[i] > 0) {     
          df$C[i] <- df$B[i]
          
        } else if(df$A[i] == 0 & df$B[i] == 0 ){ 
          df$C[i] <- 0
          
        }  else {
          df$C[i] <- df$A[i] + df$B[i] - df$D[i]  
        }
        
        df$D[i+1] <- ifelse(df$D[i] <= -df$C[i], 0, df$D[i] + df$C[i]) 
        
      }
      return(df)
    }
    
    res1 <- f_R(df)
    res2 <- f_Rcpp(df)
    
    identical(res1, res2)
    #> [1] TRUE
    

    But look what happens when we benchmark:

    microbenchmark::microbenchmark(f_R(df), f_Rcpp(df), times = 10)
    #> Unit: microseconds
    #>       expr         min        lq         mean      median          uq         max neval cld
    #>    f_R(df) 1746032.401 1793779.0 1794274.9209 1802222.051 1810686.801 1815285.001    10   b
    #> f_Rcpp(df)     567.701     585.9     610.1607     601.851     642.801     650.101    10  a 
    

    The Rcpp function processes all 10,000 rows in less than a millisecond, as opposed to almost 2 seconds in basic R. The Rcpp version is almost 3,000 times faster.


    Edit

    To get this working with your own data, try:

    cppFunction('
    
    DataFrame f_Rcpp(DataFrame df, NumericVector v) {
      NumericVector A = df["Tav"];
      NumericVector B = df["dprcp"];
      NumericVector C = df["dSWE"];
      NumericVector D = df["simSWE"];
      NumericVector E = df["dSWElag"];
    
      for(int i = 5; i < (df.nrows() - 1); ++i) {
        if (A[i] < -1 && B[i] > 0) {     
          C[i] = B[i];
        } else if(A[i] < -1 && B[i] == 0 ) { 
          C[i] = 0;
        }  else {
          C[i] = v[i];
        }
        
        if(D[i-1] <= -C[i]) {
          D[i] = 0;
        } else {
          D[i] = D[i-1] + C[i]; 
        }
        E[i + 1] = C[i];
      }
    
      df["dSWE"] = C;
      df["simSWE"] = D;
      df["dSWElag"] = E;
      
      return(df);
    }        
    ')
    

    Which you could call like this:

    preds <- predict(svm_model,station)
    
    station2 <- f_Rcpp(station, preds)