Search code examples
rfor-looprcpp

Reduce computation time of bivariate ecdf with large vectors in R


I would like to calculate the bivariate empirical cumulative density function for two very large vectors (over 250 million elements) to calculate the percentage for each pair of values i:n with a for-loop and store it in a result vector. Due to the length of the two vectors it is already obvious that the calculation time would be extremely long, so I would like to translate my for-loop into rcpp.

# minimal working example

vec_a <- runif(1e+4)
vec_b <- rnorm(1e+4)
total <- length(vec_b)
store <- vector()

for(i in 1:total){store[i] <- sum(vec_a <= vec_a[i] & vec_b <= vec_b[i])/total}

I tried to translate my loop, but since I just started to work with rcpp, some things are not quite clear to me. I would be happy if someone could give me an answer a.) why the results are not identical and b.) if it would be possible to speed up the rcpp code.

# Rcpp protoype
library(Rcpp)
cppFunction(
  "NumericVector FasterLoop(NumericVector x, NumericVector y) {
  const int n = x.length();
  NumericVector z(n);
  for (int i=0; i < n; ++i) {
   z[i] = sum(x <= x[i] & y <= y[i])/n;
  }
  return z;
}")

proto <- FasterLoop(vec_a, vec_b)

Solution

  • The problem is that sum(x <= x[i] & y <= y[i]) returns an integer, and then sum(x <= x[i] & y <= y[i])/n performs an integer division. You have to cast sum(x <= x[i] & y <= y[i]) to a double. This is automatically done by doing z[i] = sum(x <= x[i] & y <= y[i]) and then by dividing z[i] by n.

    library(Rcpp)
    cppFunction(
      "NumericVector FasterLoop(NumericVector x, NumericVector y) {
      const int n = x.length();
      NumericVector z(n);
      for (int i=0; i < n; ++i) {
       z[i] = sum(x <= x[i] & y <= y[i]);
      }
      return z/n;
    }")
    
    FasterLoop(c(1,2,3), c(1,2,3))