Search code examples
rdata.tablesampling

How to sample percent by group using data.table?


This post discusses a routine for sampling with different percentages by group.

But what about if you just want to sample, say, 50% without replacement by group? What about if you want to sample 50% with replacement by group?

With dplyr, you have sample_frac to perform this. What about data.table?


Solution

  • If the group ordering of the data.table to be sampled remains stable throughout the simulation, pre-calculating the indices more than doubles the speed for thousands of replications.

    library(data.table)
    
    dt <- data.table(A = sample(1:10, 1e3, 1), B = sample(1000))
    
    system.time(for (i in 1:1e4) dt[dt[, .I[sample(.N, .N%/%2)], A][[2]]])
    #>    user  system elapsed 
    #>    4.83    0.23    5.06
    system.time({
      idx <- dt[,.(.(.I)), A][[2]]
      for (i in 1:1e4) dt[unlist(lapply(idx, function(x) sample(x, length(x)%/%2)))]
    })
    #>    user  system elapsed 
    #>    1.78    0.13    1.90