Given the following data frame df
and a numeric vector p
containing one value:
df <- data.frame(id = c(rep(1, 110), rep(2, 290)),
m = c(seq(1, 110), seq(1:290)),
m1 = c(rep(108, 110), rep(288, 290)),
m2 = c(rep(3, 400)),
f1 = c(rep(-100, 110), rep(-50, 290)),
f2 = c(rep(22, 110), rep(15, 290)),
f3 = c(rep(5, 110), rep(0, 290)),
u = c(c(0.12, 0.16, 0.10), rep(0, 107), c(0.085, 0.09, 0.11), rep(0, 287)),
v = c(rep(0.175, 3), rep(0, 107), rep(0.115, 3), rep(0, 287)),
y = rep(0, 400))
df$s <- sqrt(df$m/(df$m1 + df$m2 - 1))/40
p <- 0.01
Here's a snippet:
> head(df)
id m m1 m2 f1 f2 f3 u v y s
1 1 1 108 3 -100 22 5 0.12 0.175 0 0.002383656
2 1 2 108 3 -100 22 5 0.16 0.175 0 0.003370999
3 1 3 108 3 -100 22 5 0.10 0.175 0 0.004128614
4 1 4 108 3 -100 22 5 0.00 0.000 0 0.004767313
5 1 5 108 3 -100 22 5 0.00 0.000 0 0.005330018
6 1 6 108 3 -100 22 5 0.00 0.000 0 0.005838742
Here are some facts about the data:
id
and m
uniquely identify each row (primary key).m
means 'month'. The dataset is therefore a time series. f1
, f2
, f3
, m1
and m2
are constant for each value of id
. These don't depend on variable m
.s
, u
and v
are not constant for each value of id
and therefore do depend on m
.id
equals m1 + m2 - 1. Or equivalent: the maximum value of m
for each value of id
equals m1 + m2 - 1.The goal is to calculate y
using the formula below:
I have created a solution that does just that:
counter <- 0
start <- proc.time()
for(n in 1:nrow(df)){
#index k holds the current value for m
k <- df$m[n]
counter <- counter + 1
#read the current value for m1 and m2
m1 <- df$m1[n]
m2 <- df$m2[n]
counter <- counter + 2
#calculate the sum of f1, f2 and f3.
sum_of_fs <- df$f1[n] + df$f2[n] + df$f3[n]
counter <- counter + 1
#initialize y. Set it to zero.
y <- 0
counter <- counter + 1
for(i in k:min(m1 + k - 1, m1 + m2 - 1)){
#Initialize the sumproduct of u and v. Set it to zero.
sumprod_uv <- 0
counter <- counter + 1
for(j in min(k, m2):max(1, i - m1 + 1)){
sumprod_uv <- sumprod_uv + df$u[j] + df$v[i - j + 1]
counter <- counter + 1
}
z <- ((1 + p)/(1 + df$s[i]))^(i / 12)
y <- y + sumprod_uv * z
counter <- counter + 2
}
y <- y * sum_of_fs
df$y[n] <- y
counter <- counter + 2
}
counter
proc.time() - start
In this piece of code I included 2 extra things:
counter
that counts the number of statements that are executed.Now the complication is that the script takes too long to run. For this toy example it took about 2 seconds (with the counter statements commented out), which is acceptable:
user system elapsed
1.829 0.002 1.872
The number of statements this duration corresponds with is 290,188 (value of counter
when script is done running)
In real life I have a dataset that contains more than 90k records. Besides that, the real dataset is slightly more complicated (7 variables that make up the id instead of one). I ran the script using that dataset and it ran for about 17 minutes.
The question is: how can I speed up this algorithm? There should be a neater way to do this.
Her you have a C++ variant which could be faster than in R.
library(Rcpp)
sourceCpp(code = "#include <Rcpp.h>
#include <vector>
#include <algorithm>
using namespace Rcpp;
// [[Rcpp::export]]
std::vector<double> fun(double &p
, std::vector<int> &dfm
, std::vector<int> &dfm1
, std::vector<int> &dfm2
, std::vector<double> &u
, std::vector<double> &v
, std::vector<double> &s
) {
std::vector<double> yy(s.size());
for(size_t n=0; n<s.size(); ++n) {
int k = dfm[n];
int m1 = dfm1[n];
int m2 = dfm2[n];
int v1 = std::min(k, m2);
double y = 0.;
int ii = std::min(m1 + k - 1, m1 + m2 - 1);
for(int i=std::min(k,ii); i<=std::max(k,ii); ++i) {
double sumprod_uv = 0.;
int jj = std::max(1, i - m1 + 1);
for (int j=std::min(v1, jj); j<=std::max(v1, jj); ++j) {
sumprod_uv += u[j-1] + v[i - j];
}
y += sumprod_uv * std::pow(((1. + p)/(1. + s[i-1])), (i / 12.));
}
yy[n] = y;
}
return yy;
}")
system.time(df$y <- fun(p, df$m, df$m1, df$m2, df$u, df$v, df$s))
# user system elapsed
# 0.005 0.000 0.004
After the question update including f1, f2, and f3:
df$y <- fun(p, df$m, df$m1, df$m2, df$u, df$v, df$s) * (df$f1 + df$f2 + df$f3)
For time comparison the timings on my pc:
#Your code
# user system elapsed
# 0.358 0.004 0.362
#@minem
# user system elapsed
# 0.090 0.003 0.093