Performance of data.table

I always assumed that data.table provided the best performance on data access.

However, I came across the following results when I benchmark the following 2 statements.

app_sig_reg[which(app_sig_reg$input == proj$country),]$value
app_sig_reg[input == proj$country,value]

where app_sig_reg is a data.table object.

This is the results I get when I run microbenchmark library to measure their performance.

microbenchmark(
  app_sig_reg[which(app_sig_reg$input == proj$country),]$value,
  app_sig_reg[input == proj$country,value]
)

Unit: microseconds
                                                          expr   min     lq     mean  median      uq    max neval
 app_sig_reg[which(app_sig_reg$input == proj$country), ]$value 118.5 132.05  165.932  146.55  163.70  489.1   100
                     app_sig_reg[input == proj$country, value] 967.3 993.85 1098.607 1028.05 1123.35 1752.6   100

My assumption was that app_sig_reg[input == proj$country,value] would execute faster, but the results indicate the opposite.

I would appreciate any insight on this.

Solution

The question is not completely clear on what to match. If it's only one country, then the results below show that speed depends on

which versus equal;
the extractors, the methods $ versus [ for objects of class "data.table".

If instead of equality tests for one element (country) the tests are for many with %in% the results may vary.

library(data.table)
library(microbenchmark)
library(ggplot2)

set.seed(2022)
app_sig_reg <- data.table(
  input = sample(letters, 100, TRUE),
  value = runif(100)
)
proj <- data.table(country = sample(letters, 1))


testFun <- function(X, n){
  out <- lapply(seq.int(n), \(k){
    Y <- X
    for(i in seq.int(k)) Y <- rbind(Y, Y)
    mb <- microbenchmark(
      `which$` = Y[which(Y$input == proj$country), ]$value,
      `which[` = Y[which(input == proj$country), value],
      `equal$` = Y[input == proj$country,]$value,
      `equal[` = Y[input == proj$country,value]
    )
    agg <- aggregate(time ~ expr, mb, median)
    agg$nrow <- nrow(Y)
    agg
  })
  do.call(rbind, out)
}

res <- testFun(app_sig_reg, 15)

ggplot(res, aes(nrow, time, color = expr)) +
  geom_line() +
  geom_point() +
  scale_color_manual(values = c(`which$` = "red", `equal$` = "orangered", `which[` = "blue", `equal[` = "skyblue")) +
  scale_x_continuous(trans = "log10") +
  scale_y_continuous(trans = "log10") +
  theme_bw()

^{Created on 2022-02-20 by the reprex package (v2.0.1)}