I would like to calculate the bivariate empirical cumulative density function for two very large vectors (over 250 million elements) to calculate the percentage for each pair of values i:n with a for-loop and store it in a result vector. Due to the length of the two vectors it is already obvious that the calculation time would be extremely long, so I would like to translate my for-loop into rcpp.
# minimal working example
vec_a <- runif(1e+4)
vec_b <- rnorm(1e+4)
total <- length(vec_b)
store <- vector()
for(i in 1:total){store[i] <- sum(vec_a <= vec_a[i] & vec_b <= vec_b[i])/total}
I tried to translate my loop, but since I just started to work with rcpp, some things are not quite clear to me. I would be happy if someone could give me an answer a.) why the results are not identical and b.) if it would be possible to speed up the rcpp code.
# Rcpp protoype
library(Rcpp)
cppFunction(
"NumericVector FasterLoop(NumericVector x, NumericVector y) {
const int n = x.length();
NumericVector z(n);
for (int i=0; i < n; ++i) {
z[i] = sum(x <= x[i] & y <= y[i])/n;
}
return z;
}")
proto <- FasterLoop(vec_a, vec_b)
The problem is that sum(x <= x[i] & y <= y[i])
returns an integer, and then sum(x <= x[i] & y <= y[i])/n
performs an integer division. You have to cast sum(x <= x[i] & y <= y[i])
to a double
. This is automatically done by doing z[i] = sum(x <= x[i] & y <= y[i])
and then by dividing z[i]
by n
.
library(Rcpp)
cppFunction(
"NumericVector FasterLoop(NumericVector x, NumericVector y) {
const int n = x.length();
NumericVector z(n);
for (int i=0; i < n; ++i) {
z[i] = sum(x <= x[i] & y <= y[i]);
}
return z/n;
}")
FasterLoop(c(1,2,3), c(1,2,3))