Search code examples
rdata.tabletidyversemicrobenchmark

Performance of data.table


I always assumed that data.table provided the best performance on data access.

However, I came across the following results when I benchmark the following 2 statements.

app_sig_reg[which(app_sig_reg$input == proj$country),]$value
app_sig_reg[input == proj$country,value]

where app_sig_reg is a data.table object.

This is the results I get when I run microbenchmark library to measure their performance.

microbenchmark(
  app_sig_reg[which(app_sig_reg$input == proj$country),]$value,
  app_sig_reg[input == proj$country,value]
)

Unit: microseconds
                                                          expr   min     lq     mean  median      uq    max neval
 app_sig_reg[which(app_sig_reg$input == proj$country), ]$value 118.5 132.05  165.932  146.55  163.70  489.1   100
                     app_sig_reg[input == proj$country, value] 967.3 993.85 1098.607 1028.05 1123.35 1752.6   100

My assumption was that app_sig_reg[input == proj$country,value] would execute faster, but the results indicate the opposite.

I would appreciate any insight on this.


Solution

  • The question is not completely clear on what to match. If it's only one country, then the results below show that speed depends on

    1. which versus equal;
    2. the extractors, the methods $ versus [ for objects of class "data.table".

    If instead of equality tests for one element (country) the tests are for many with %in% the results may vary.

    library(data.table)
    library(microbenchmark)
    library(ggplot2)
    
    set.seed(2022)
    app_sig_reg <- data.table(
      input = sample(letters, 100, TRUE),
      value = runif(100)
    )
    proj <- data.table(country = sample(letters, 1))
    
    
    testFun <- function(X, n){
      out <- lapply(seq.int(n), \(k){
        Y <- X
        for(i in seq.int(k)) Y <- rbind(Y, Y)
        mb <- microbenchmark(
          `which$` = Y[which(Y$input == proj$country), ]$value,
          `which[` = Y[which(input == proj$country), value],
          `equal$` = Y[input == proj$country,]$value,
          `equal[` = Y[input == proj$country,value]
        )
        agg <- aggregate(time ~ expr, mb, median)
        agg$nrow <- nrow(Y)
        agg
      })
      do.call(rbind, out)
    }
    
    res <- testFun(app_sig_reg, 15)
    
    ggplot(res, aes(nrow, time, color = expr)) +
      geom_line() +
      geom_point() +
      scale_color_manual(values = c(`which$` = "red", `equal$` = "orangered", `which[` = "blue", `equal[` = "skyblue")) +
      scale_x_continuous(trans = "log10") +
      scale_y_continuous(trans = "log10") +
      theme_bw()
    

    Created on 2022-02-20 by the reprex package (v2.0.1)