Search code examples
rdata.tablesample

How to efficiently sample from a datatable by column in R?


How can I efficiently sample one row for each unique variable in a column from a datatable in R? For example, given the data.table:

library(data.table)
set.seed(1)

dt <- data.table( 
                   A = sample(c("A", "B", "C", "D", "E"), 100, replace = T),
                   B = sample(1:100, 100, replace = T),
                   C = sample(101:200, 100, replace = T) 
                 )

I need to sample one row for each unique character in column A. For example:

out <- list()
for (i in 1:length(unique(dt$A))){
  out[[i]] <- dt[sample(dt[, .I[A == unique(dt$A)[i]]], 1, replace = T)]
}
out <- do.call("rbind", out)

However, the data table I am applying this to is vary large. Is there a data.table method I can use to improve performance?


Solution

  • You can use sample on .N for each group and select 1 random row.

    library(data.table)
    set.seed(123)
    dt[, .SD[sample(.N, 1)], A]
    
    #   A   B   C
    #1: A  31 143
    #2: D  16 175
    #3: B 100 165
    #4: E  27 190
    #5: C  90 197
    

    dplyr has slice_sample (previously sample_n) function for it :

    library(dplyr)
    dt %>% group_by(A) %>% slice_sample(n = 1)