Search code examples
rdata.tablevectorizationsequential

sequentially update rows in data.table


I have a very big dataset and I would like to perform the following computation in R using data.table:

library(data.table)

# This is a test dataset
tempData <-data.table(
  drugName = rep("Aspirine", times = 4),
  startdt = c("2012-01-01",
              "2012-01-20",
              "2012-02-15",
              "2012-03-10"),
  daysupp = c(30,30,10,20))


# An example of the desired computation
tempData[, startdt:= as.Date(startdt)]
tempData[1, enddt:= startdt + daysupp]

for (i in 2:nrow(tempData)) {

  if (tempData[i,startdt] >= tempData[i-1,enddt]) {
    tempData[i, enddt:= startdt + daysupp]

  } else {
    tempData[i, enddt:= tempData[i-1,enddt] + daysupp]
  }

}

This computation should be made for different drug names so I can create a function of the for loop and use it into my DT with a group by on brandname. This computation takes a lot of time. I am wondering if there is a way to sequentially update the data.table rows using a vectorized approach.

I was thinking of using shift however I cannot find a way to update the enddt variable sequentially by following these two if statements.

This is a general question on how to approach this type of computations really fast.


Solution

  • I'd write a simple Rcpp function instead of spending time trying to find a vectorized R solution:

    library(Rcpp)
    sourceCpp(code = "
              #include <Rcpp.h>
              // [[Rcpp::export]]
              Rcpp::IntegerVector myfun(const Rcpp::IntegerVector x, const Rcpp::IntegerVector y) {
              Rcpp::IntegerVector res = x;
              res(0) = x(0) + y(0);
              for (int i=1; i<x.length(); i++) {
                if (x(i) >= res(i-1)) res(i) += y(i);
                else res(i) = res(i-1) + y(i);
              }
              return res;
              }
              ")
    tempData[, enddt1 := myfun(startdt, daysupp)]
    #   drugName    startdt daysupp      enddt     enddt1
    #1: Aspirine 2012-01-01      30 2012-01-31 2012-01-31
    #2: Aspirine 2012-01-20      30 2012-03-01 2012-03-01
    #3: Aspirine 2012-02-15      10 2012-03-11 2012-03-11
    #4: Aspirine 2012-03-10      20 2012-03-31 2012-03-31