Search code examples
rdata.tablesample

Pick a uniform distributed sample in a datatable


Suppose I have an example dataset looking like this:

df = data.table(id = 1:100,group=rep(c('a','b','c','d'),25))

I would like to take, let's say, 80 observations from this set in x non-overlapping samples. Important feature is that the distribution of each sample must be uniform among each group.

For example:

x=20 will give a first sample of
1 a
5 b
15 c
28 d

This is a very convenient example, but it must also be applicable to less convenient cases (when x=7 for example).

My first try was using split, like this:

df_split = split(df, as.numeric(as.factor(df$id)) %% 7)

that does what I want, except it does not uniformly pick from each group!


Solution

  • If I understand this correctly, since you are looking for 7 sets of 80 samples, you may want to run this as a loop:

    dt <- data.table(id = 1:100,group=rep(c('a','b','c','d'),25))
    
    newmat <- data.frame(Index = 1:80)
    for(i in 1:7){
      k <- NULL
      for(j in unique(dt$group)){
        dt.sub <- dt[group == j]
        samps <- sample_n(dt.sub, 20, replace = F)
        k <- c(k,samps$id)
      }
      newmat <- cbind(newmat, k)
    }
    
    colnames(newmat) <- c("Index", paste0("k",1:7))