Suppose I have an example dataset looking like this:
df = data.table(id = 1:100,group=rep(c('a','b','c','d'),25))
I would like to take, let's say, 80 observations from this set in x non-overlapping samples. Important feature is that the distribution of each sample must be uniform among each group.
For example:
x=20 will give a first sample of
1 a
5 b
15 c
28 d
This is a very convenient example, but it must also be applicable to less convenient cases (when x=7 for example).
My first try was using split, like this:
df_split = split(df, as.numeric(as.factor(df$id)) %% 7)
that does what I want, except it does not uniformly pick from each group!
If I understand this correctly, since you are looking for 7 sets of 80 samples, you may want to run this as a loop:
dt <- data.table(id = 1:100,group=rep(c('a','b','c','d'),25))
newmat <- data.frame(Index = 1:80)
for(i in 1:7){
k <- NULL
for(j in unique(dt$group)){
dt.sub <- dt[group == j]
samps <- sample_n(dt.sub, 20, replace = F)
k <- c(k,samps$id)
}
newmat <- cbind(newmat, k)
}
colnames(newmat) <- c("Index", paste0("k",1:7))