Search code examples
rperformanceloopsindices

How to speed up this calculation using 3 for loops with a complex progress of indices?


Given the following data frame df and a numeric vector p containing one value:

df <- data.frame(id = c(rep(1, 110), rep(2, 290)),
                 m  = c(seq(1, 110), seq(1:290)),
                 m1 = c(rep(108, 110), rep(288, 290)),
                 m2 = c(rep(3, 400)),
                 f1 = c(rep(-100, 110), rep(-50, 290)),
                 f2 = c(rep(22, 110), rep(15, 290)),
                 f3 = c(rep(5, 110), rep(0, 290)),
                 u  = c(c(0.12, 0.16, 0.10), rep(0, 107), c(0.085, 0.09, 0.11), rep(0, 287)),
                 v  = c(rep(0.175, 3), rep(0, 107), rep(0.115, 3), rep(0, 287)),
                 y  = rep(0, 400))

df$s <- sqrt(df$m/(df$m1 + df$m2 - 1))/40

p <- 0.01

Here's a snippet:

> head(df)
  id m  m1 m2   f1 f2 f3    u     v y           s
1  1 1 108  3 -100 22  5 0.12 0.175 0 0.002383656
2  1 2 108  3 -100 22  5 0.16 0.175 0 0.003370999
3  1 3 108  3 -100 22  5 0.10 0.175 0 0.004128614
4  1 4 108  3 -100 22  5 0.00 0.000 0 0.004767313
5  1 5 108  3 -100 22  5 0.00 0.000 0 0.005330018
6  1 6 108  3 -100 22  5 0.00 0.000 0 0.005838742

Here are some facts about the data:

  1. Variables id and m uniquely identify each row (primary key).
  2. Variable m means 'month'. The dataset is therefore a time series.
  3. Variables f1, f2, f3, m1 and m2 are constant for each value of id. These don't depend on variable m.
  4. Variables s, u and v are not constant for each value of id and therefore do depend on m.
  5. The number of rows for each value of id equals m1 + m2 - 1. Or equivalent: the maximum value of m for each value of id equals m1 + m2 - 1.

The goal is to calculate y using the formula below:

I have created a solution that does just that:

counter <- 0
start   <- proc.time()

for(n in 1:nrow(df)){

  #index k holds the current value for m
  k <- df$m[n]
  counter <- counter + 1

  #read the current value for m1 and m2
  m1 <- df$m1[n]
  m2 <- df$m2[n]
  counter <- counter + 2

  #calculate the sum of f1, f2 and f3.
  sum_of_fs <- df$f1[n] + df$f2[n] + df$f3[n]
  counter <- counter + 1

  #initialize y. Set it to zero.
  y <- 0
  counter <- counter + 1

  for(i in k:min(m1 + k - 1, m1 + m2 - 1)){

    #Initialize the sumproduct of u and v. Set it to zero.
    sumprod_uv <- 0
    counter <- counter + 1

    for(j in min(k, m2):max(1, i - m1 + 1)){

      sumprod_uv <- sumprod_uv + df$u[j] + df$v[i - j + 1]
      counter <- counter + 1

    }  

    z <- ((1 + p)/(1 + df$s[i]))^(i / 12)
    y <- y + sumprod_uv * z
    counter <- counter + 2  
  }  

  y <- y * sum_of_fs
  df$y[n] <- y
  counter <- counter + 2
}

counter

proc.time() - start

In this piece of code I included 2 extra things:

  1. A counter named counter that counts the number of statements that are executed.
  2. A timer that measures the duration of the script.

Now the complication is that the script takes too long to run. For this toy example it took about 2 seconds (with the counter statements commented out), which is acceptable:

   user  system elapsed 
  1.829   0.002   1.872 

The number of statements this duration corresponds with is 290,188 (value of counter when script is done running)

In real life I have a dataset that contains more than 90k records. Besides that, the real dataset is slightly more complicated (7 variables that make up the id instead of one). I ran the script using that dataset and it ran for about 17 minutes.

The question is: how can I speed up this algorithm? There should be a neater way to do this.


Solution

  • Her you have a C++ variant which could be faster than in R.

    library(Rcpp)
    sourceCpp(code = "#include <Rcpp.h>
    #include <vector>
    #include <algorithm>
    
    using namespace Rcpp;
    
    // [[Rcpp::export]]
    std::vector<double> fun(double &p
    , std::vector<int> &dfm
    , std::vector<int> &dfm1
    , std::vector<int> &dfm2
    , std::vector<double> &u
    , std::vector<double> &v
    , std::vector<double> &s
    ) {
    std::vector<double> yy(s.size());
    for(size_t n=0; n<s.size(); ++n) {
      int k = dfm[n];
      int m1 = dfm1[n];
      int m2 = dfm2[n];
      int v1 = std::min(k, m2);
      double y = 0.;
      int ii = std::min(m1 + k - 1, m1 + m2 - 1);
      for(int i=std::min(k,ii); i<=std::max(k,ii); ++i) {
        double sumprod_uv = 0.;
        int jj = std::max(1, i - m1 + 1);
        for (int j=std::min(v1, jj); j<=std::max(v1, jj); ++j) {
          sumprod_uv += u[j-1] + v[i - j];
        }  
        y += sumprod_uv * std::pow(((1. + p)/(1. + s[i-1])), (i / 12.));
      }
      yy[n] = y;
    }
    return yy;
    }")
    system.time(df$y <- fun(p, df$m, df$m1, df$m2, df$u, df$v, df$s))
    #   user  system elapsed 
    #  0.005   0.000   0.004 
    

    After the question update including f1, f2, and f3:

    df$y <- fun(p, df$m, df$m1, df$m2, df$u, df$v, df$s) * (df$f1 + df$f2 + df$f3)
    

    For time comparison the timings on my pc:

    #Your code
    #   user  system elapsed 
    #  0.358   0.004   0.362 
    
    #@minem
    #  user  system elapsed 
    #  0.090   0.003   0.093