Search code examples
rdata.tablesapply

data.table - apply vector of values


I'm somewhat bogged down by this question. I have a data table of beta distribution parameters, each row in the data table corresponding to a relative probability of that distribution to represent the actual outcome.

I want to compute the cumulative distribution function for a number of sample values. Using sapply, the code looks like this:

beta_dists <- data.table(data.frame(probs = c(0.4,0.3,0.3), a = c(0.0011952,0.001,0.00809), b = c(837,220,624), scale = c(1.5e9,115e6,1.5e6)))
xx <- seq(0,1.5e9,length = 2^12)

system.time(FX <- sapply(xx, function(x) (beta_dists[x < scale,.(FX = sum(probs * (1 - pbeta(x / scale, a, b))))])$FX))

However, that's quite slow and does not seem very elegant... Any thoughts on how to make this better?


Solution

  • Here is a suggestion to use a non-equi join by converting your xx into a data.table to be used in i:

    ans <- beta_dists[dtx, on=.(scale > x), allow.cartesian=TRUE,
        sum(probs * (1 - pbeta(x / x.scale, a, b))), by=.EACHI]$V1
    

    check:

    #last element is NA in ans whereas its NULL in FX
    identical(unlist(FX), head(ans$V1, -1))
    #[1] TRUE
    

    timing code:

    opmtd <- function() {
        sapply(xx, function(x) (beta_dists[x < scale,.(FX = sum(probs * (1 - pbeta(x / scale, a, b))))])$FX)
    }
    
    nonequiMtd <- function() {
        beta_dists[dtx, on=.(scale > x), allow.cartesian=TRUE, sum(probs * (1 - pbeta(x / x.scale, a, b))), by=.EACHI]   
    }
    
    vapplyMtd <- function() {
        dt[, res := vapply(x, f, 0)]
    }
    
    library(microbenchmark)
    microbenchmark(opmtd(), nonequiMtd(), vapplyMtd(), times=3L)
    

    timings:

    Unit: milliseconds
             expr        min         lq       mean     median         uq        max neval
          opmtd() 2589.67889 2606.77795 2643.77975 2623.87700 2670.83018 2717.78336     3
     nonequiMtd()   19.59376   21.12739   22.28428   22.66102   23.62954   24.59805     3
      vapplyMtd() 1928.25841 1939.91866 1960.31181 1951.57891 1976.33852 2001.09812     3
    

    data:

    library(data.table)
    beta_dists <- data.table(probs = c(0.4,0.3,0.3), a = c(0.0011952,0.001,0.00809), b = c(837,220,624), scale = c(1.5e9,115e6,1.5e6))
    xx <- seq(0, 1.5e9, length = 2^12)
    dtx <- data.table(x=xx)