Search code examples
rdplyrdata.tablercppdtplyr

R: Efficient iterative subsetting and filtering of large vector


I'd like to perform the following operation more quickly.

Logic: I have a vector big of 4 elements 1, 2, 3, 4. I also have a same-length vector of thresholds 1.1, 3.1, 4.1, 5.1. I want for each element to find the index of the first next element to be above the corresponding threshold. In this case my expected output is

2, 3, NA, NA:

  • the first element after the first one (included) which is above the threshold of 1.1 is at index 2 (value of 2).
  • The first element above the second threshold of 3.1 is of value 4, and is the third element after the current one at index 2 (included).

Base implementation

start <- Sys.time()
bigg <- rnorm(25000)
thresh <- bigg+0.5
result <- rep(NA, length(bigg))
for(i in 1:length(bigg)) {
  result[i] <- which(bigg[(i+1):length(bigg)]>thresh[i])[1] # the first next element that is higher than thresh
  if(i%%1000==0) print(paste0(i, " ", round(i/length(bigg),3)))
}
end <- Sys.time()
end-start
head(result)

Basically, taking the first element of the vector x after the current one that satisfies a threshold condition.

I tried using Rcpp

// [[Rcpp::export]]
int cppnextup_(NumericVector x, double thresh, bool is_up = true) {
  int n = x.size();
  //int idx = 0;
  int res = -1;
  for(int idx = 0; idx < n; ++idx) {
    if(x[idx]>thresh && is_up == true) {
      res = idx;
//Rcout << "The value of idx : " << idx <<" "<< x[idx]<<"\n";
      break;
    }
    if(x[idx]<thresh && is_up == false) {
      res = idx;
      //Rcout << "The value of idx : " << idx <<" "<< x[idx]<<"\n";
      break;
    }
  }
  return res;
}

Benchmarking:

# base --------------------------------------------------------------------

base_ <- function() {
  for(i in 1:length(bigg)) {
    result[i] <- which(bigg[(i+1):length(bigg)]>thresh[i])[1] # the first next element that is higher than thresh
    if(i%%1000==0) print(paste0(i, " ", round(i/length(bigg),3)))
  }
}

# cpp ----------------------------------------------------------------

result_cpp <- rep(NA, length(bigg))
cpp_ <- function() {
  for(i in 1:length(bigg)) {
    result_cpp[i] <- cppnextup_(bigg[(i+1):length(bigg)], thresh[i]) # the first next element that is higher than thresh
    if(i%%1000==0) print(paste0(i, " ", round(i/length(bigg),3)))
  }
}

#result_cpp <- ifelse(result_cpp==-1, NA, result_cpp)
#result_cpp <- result_cpp+1
#all.equal(result, result_cpp)
#[1] TRUE

# benchmark ---------------------------------------------------------------

microbenchmark::microbenchmark(base_(),
                               cpp_(), times=3)
Unit: milliseconds
    expr      min        lq      mean    median        uq       max neval
 base_() 2023.510 2030.3154 2078.7867 2037.1211 2106.4252 2175.7293     3
  cpp_()  661.277  665.3456  718.8851  669.4141  747.6891  825.9641     3

My Rcpp implementation reduces base time by 65%, is there a better (vectorized) way? Looking for any backend, be it Rcpp, data.table, dtplyr etc.

My dtplyr attempt yields all NA's:

library(dtplyr)
nx <- length(bigg)
df <- tibble(bigg, thresh)
bigg %>% lazy_dt() %>% mutate(res = which(bigg[row_number():nx]>thresh)[1])
Warning message:
In seq_len(.N):..nx :
  numerical expression has 25000 elements: only the first used

Cheers

Btw, my real vector has 8,406,600 elements.


EDIT: vectorized Rcpp

I also have another, faster Rcpp function which relies on the first one:

// [[Rcpp::export]]
NumericVector cppnextup(NumericVector x, double threshup, bool is_up = true) {
  int n = x.size();
  NumericVector up(n);
  if(is_up == true) {
    up = x + threshup;
  } else {
    up = x - threshup;
  }
//  Rcout << "The value of up : " << up[0] <<" "<< up[1] <<"\n";
  NumericVector result(n);
  int idx = 0;
  for(int i = 0; i < n; ++i) {
    double thisup = up[idx];
    NumericVector thisvect = x[Rcpp::Range((idx), (n-1))];
    
//Rcout <<idx<< " " << "thisvect : " << thisvect[0] <<" thisup: "<< thisup <<" buy " << buy << "\n";
    
    int resi = cppnextup_(thisvect, thisup, is_up = is_up);
    if(resi != 0) {
      result[idx] = resi+1;
    } else {
      result[idx] = resi;
    }
    
    //Rcout << "RESI: " << resi <<" "<< up[1] <<"\n";
    idx = idx + 1;
  }
  return result;
}

As you can see it is faster than the previous two:

# cpp_vectorized ----------------------------------------------------------

cpp_vect <- function(bigg) {
  res_cppvect <- cppnextup(bigg, 0.5)
}

    # benchmark ---------------------------------------------------------------
    
    microbenchmark::microbenchmark(base_(),
                                   cpp_(), 
                                   cpp_vect(),
                                   times=3)
           expr       min        lq      mean    median        uq       max neval
        base_() 2014.7211 2016.8679 2068.9869 2019.0146 2096.1198 2173.2250     3
         cpp_()  663.0874  666.1540  718.5863  669.2207  746.3357  823.4507     3
     cpp_vect()  214.1745  221.2103  223.9532  228.2460  228.8426  229.4392     3

BUT when I pass a larger vector in argument, it freezes and never returns a result.

res <- cpp_vect(bigg=rnorm(1000000)) # freezes

Any help welcome.


Solution

  • A data.table non-equi join with mult = "first" works well. It won't be as fast as an optimized Rcpp function, though.

    library(data.table)
    
    bigg <- rnorm(25000)
    thresh <- bigg+0.5
    f1 <- function(bigg, thresh) {
      result <- rep(NA, length(bigg))
      for(i in 1:length(bigg)) {
        result[i] <- which(bigg[(i+1):length(bigg)]>thresh[i])[1] # the first next element that is higher than thresh
      }
      result
    }
    
    f2 <- function(bigg, thresh) {
      data.table(
        val = bigg,
        r = seq_along(bigg)
      )[
        data.table(
          val = thresh,
          r = seq_along(thresh)
        ),
        on = .(val > val, r > r),
        .(result = x.r - i.r),
        mult = "first"
      ]$result
    }
    
    microbenchmark::microbenchmark(f1 = f1(bigg, thresh),
                                   f2 = f2(bigg, thresh),
                                   times = 10,
                                   check = "identical")
    #> Unit: milliseconds
    #>  expr      min       lq      mean    median       uq       max neval
    #>    f1 2167.139 2199.801 2217.6945 2222.4937 2233.254 2250.1693    10
    #>    f2  605.999  610.576  612.0431  611.1439  614.195  618.6248    10
    
    bigg <- rnorm(1e6)
    thresh <- bigg+0.5
    system.time(f2(bigg, thresh))
    #>    user  system elapsed 
    #>  375.71    0.15  375.81