Search code examples
rperformancedata.tablesubsetbenchmarking

Improving data.table subsetting performance


I am running a large monte-carlo simulation, and I discovered that sub-setting/searching my data is the slowest part of my code. In order to test some alternatives, I bench-marked performance with dataframes, data.table, and a matrix. Here is the benchmark code:

library(data.table)
#install.packages('profvis')
library(profvis)
x.df = data.frame(a=sample(1:10,10000,replace=T), b=sample(1:10,10000,replace=T)) # set up a dataframe
x.dt = as.data.table(x.df) # a data.table
setkey(x.dt,a) # set key for faster searches
x.mat = as.matrix(x.df) # a matrix

profvis({
for (i in 1:10000) {
  # test simple subsetting
  xsubset.mat = x.mat[100:200,2]
  xsubset.df = x.df[100:200,2]
  xsubset.dt = x.dt[100:200,2]
  # test search preformance
  xsearch.mat = x.mat[which(x.df$a==10),2]
  xsearch.df = x.df[which(x.df$a==10),2]
  xsearch.dt = x.dt[.(10),2]
}
})

Here are my results: benchmark scores In all seriousness, I love the compact syntax of data.table, and I am wondering if there is something I can do to improve its performance. According to the creators, its supposed to be super fast. Am I using it incorrectly?


Solution

  • After some more benchmarking, I now understand the issue. The fastest package depends on whether I'm doing many small searches or one big search. It seems that data.table has a lot of overhead per search, making it more suited for working with one huge table, not many searches on small ones.

    Consider the following code, and compare with the original:

    # make a giant table, but search it only once:
    x.df = data.frame(a=sample(1:10,100000000,replace=T), b=sample(1:10,100000000,replace=T))
    x.dt = as.data.table(x.df)
    setkey(x.dt,a)
    x.mat = as.matrix(x.df)
    
    profvis({
    for (i in 1:1) {
      xsubset.mat = x.mat[100:200,2]
      xsubset.df = x.df[100:200,2]
      xsubset.dt = x.dt[100:200,2]
    
      xsearch.mat = x.mat[which(x.df$a==10),2]
      xsearch.df = x.df[which(x.df$a==10),2]
      xsearch.dt = x.dt[.(10),2]
    }
    })
    

    Results: searching one big table