For a machine learning model training, I'm trying to sample a dataframe that has a grouping variable, so that each group is treated with a different sampling rule. For instance, my data:
df = data.frame(value = 1:10, label=c("a", "a", "b", rep("c", 7)))
For groups of size under, say, 3, I want to take the whole group and no more, and for bigger groups I want to take a sample of size 3 without replacement.
So here, the result could be: df[c(1:3, 6,9,10),]
If I use group_by
and sample_n
, I get an size error. I thought of going "manual" with splits and differentiated sampling and then bind again the rows, but is there a more efficient and direct way?
Using the size of the group n()
, in sample_n
.
df %>% group_by(label) %>% sample_n(min(n(), 3))
# A tibble: 6 x 3
# Groups: label [3]
# value label n
# <int> <fct> <int>
#1 1 a 2
#2 2 a 2
#3 3 b 1
#4 5 c 7
#5 10 c 7
#6 4 c 7