Search code examples
rperformancedata.tablerowwise

Preferred performant procedure for R data.table row-wise operations?


Does the following code represent the preferred procedure for traversing the rows of an R data.table and passing the values found at each row to a function? Or is there a more performant way to do this?

library(data.table)
set.seed(2)
n <- 100
b <- c(0.5, 1.5, -1)
phi <- 0.8
X <- cbind(1, matrix(rnorm(n*2, 0, 1), ncol = 2))
y <- X %*% matrix(b, ncol = 1) + rnorm(n, 0, phi)
d <- data.table(y, X)
setnames(d, c("y", "x0", "x1", "x2"))

logpost <- function(d, b1, b2, b3, phi, mub = 1, taub = 10, a = 0.5, z = 0.7){
    N <- nrow(d)
    mu <- b1 + b2 * d$x1 + b3 * d$x2
    lp <- -N * log(phi) -
        (1/(2*phi^2)) * sum( (d$y-mu)^2  ) -
        (1/(2*taub^2))*( (b1-mub)^2 + (b2-mub)^2 + (b3-mub)^2 ) -
        (a+1)*log(phi) - (z/phi)
    lp
}

nn <- 21
grid <- data.table(
expand.grid(b1 = seq(0, 1, len = nn),
    b2 = seq(1, 2, len = nn),
    b3 = seq(-1.5, -0.5, len = nn),
    phi = seq(0.4, 1.2, len = nn)))
grid[, id := 1:.N]
setkey(grid, id)

wraplogpost <- function(dd){
    logpost(d, dd$b1, dd$b2, dd$b3, dd$phi)
}
start <- Sys.time()
grid[, lp := wraplogpost(.SD), by = seq_len(nrow(grid))]
difftime(Sys.time(), start)
# Time difference of 2.081544 secs

Edit: display first few records

> head(grid)
b1 b2   b3 phi id        lp
1: 0.00  1 -1.5 0.4  1 -398.7618
2: 0.05  1 -1.5 0.4  2 -380.3674
3: 0.10  1 -1.5 0.4  3 -363.5356
4: 0.15  1 -1.5 0.4  4 -348.2663
5: 0.20  1 -1.5 0.4  5 -334.5595
6: 0.25  1 -1.5 0.4  6 -322.4152

I have tried using set but that approach seems inferior

start <- Sys.time()
grid[, lp := NA_real_]
for(i in 1:nrow(grid)){
    llpp <- wraplogpost(grid[i])
    set(grid, i, "lp", llpp)
}
difftime(Sys.time(), start)
# Time difference of 21.71291 secs

Edit: display first few records

> head(grid)
b1 b2   b3 phi id        lp
1: 0.00  1 -1.5 0.4  1 -398.7618
2: 0.05  1 -1.5 0.4  2 -380.3674
3: 0.10  1 -1.5 0.4  3 -363.5356
4: 0.15  1 -1.5 0.4  4 -348.2663
5: 0.20  1 -1.5 0.4  5 -334.5595
6: 0.25  1 -1.5 0.4  6 -322.4152

Suggestions or pointers to the relevant docs would be appreciated.

Edit: per comments:

start <- Sys.time()
grid[, lp := wraplogpost(.SD), by = .I]
difftime(Sys.time(), start)
Warning messages:
1: In b2 * d$x1 :
    longer object length is not a multiple of shorter object length
2: In b3 * d$x2 :
    longer object length is not a multiple of shorter object length
3: In d$y - mu :
    longer object length is not a multiple of shorter object length
> difftime(Sys.time(), start)
Time difference of 0.01199317 secs
> 
> head(grid)
b1 b2   b3 phi id        lp
1: 0.00  1 -1.5 0.4  1 -620977.2
2: 0.05  1 -1.5 0.4  2 -620977.2
3: 0.10  1 -1.5 0.4  3 -620977.2
4: 0.15  1 -1.5 0.4  4 -620977.2
5: 0.20  1 -1.5 0.4  5 -620977.2
6: 0.25  1 -1.5 0.4  6 -620977.2

which generates the wrong values for lp.

Edit thank you for the comments and responses. I am aware that this scenario could be addressed by using alternative methods, my interest is in what the preferred way to do this is when using data.table.

Edit thank you for the responses again. As there have been none that address the question of how to do this explicitly with data.table, at the moment, I am assuming that there is no ideal way to achieve this without turning to base R.


Solution

  • I think you can use matrix multiplication and other vectorization techniques to simplify your code, which helps you avoid running function logpost in a row-wise manner.


    Below is a vectorized version of logpost, i.e., logpost2

    logpost2 <- function(d, dd, mub = 1, taub = 10, a = 0.5, z = 0.7) {
      bmat <- as.matrix(dd[, .(b1, b2, b3)])
      xmat <- cbind(1, as.matrix(d[, .(x1, x2)]))
      phi <- dd$phi
      phi_log <- log(phi)
      lp <- -(a + nrow(d) + 1) * phi_log -
        (1 / (2 * phi^2)) * colSums((d$y - tcrossprod(xmat, bmat))^2) -
        (1 / (2 * taub^2)) * rowSums((bmat - mub)^2) - (z / phi)
      lp
    }
    

    and you will see

    > start <- Sys.time()
    
    > grid[, lp := logpost2(d, .SD)]
    
    > difftime(Sys.time(), start)
    Time difference of 0.1966231 secs
    

    and

    > head(grid)
         b1 b2   b3 phi id        lp
    1: 0.00  1 -1.5 0.4  1 -398.7618
    2: 0.05  1 -1.5 0.4  2 -380.3674
    3: 0.10  1 -1.5 0.4  3 -363.5356
    4: 0.15  1 -1.5 0.4  4 -348.2663
    5: 0.20  1 -1.5 0.4  5 -334.5595
    6: 0.25  1 -1.5 0.4  6 -322.4152