How can I efficiently sample one row for each unique variable in a column from a datatable in R? For example, given the data.table:
library(data.table)
set.seed(1)
dt <- data.table(
A = sample(c("A", "B", "C", "D", "E"), 100, replace = T),
B = sample(1:100, 100, replace = T),
C = sample(101:200, 100, replace = T)
)
I need to sample one row for each unique character in column A. For example:
out <- list()
for (i in 1:length(unique(dt$A))){
out[[i]] <- dt[sample(dt[, .I[A == unique(dt$A)[i]]], 1, replace = T)]
}
out <- do.call("rbind", out)
However, the data table I am applying this to is vary large. Is there a data.table method I can use to improve performance?
You can use sample
on .N
for each group and select 1 random row.
library(data.table)
set.seed(123)
dt[, .SD[sample(.N, 1)], A]
# A B C
#1: A 31 143
#2: D 16 175
#3: B 100 165
#4: E 27 190
#5: C 90 197
dplyr
has slice_sample
(previously sample_n
) function for it :
library(dplyr)
dt %>% group_by(A) %>% slice_sample(n = 1)