Search code examples
rgroup-bydplyrsampling

Differenciated sampling rate by group


For a machine learning model training, I'm trying to sample a dataframe that has a grouping variable, so that each group is treated with a different sampling rule. For instance, my data:

df = data.frame(value = 1:10, label=c("a", "a", "b", rep("c", 7)))

For groups of size under, say, 3, I want to take the whole group and no more, and for bigger groups I want to take a sample of size 3 without replacement.

So here, the result could be: df[c(1:3, 6,9,10),]

If I use group_by and sample_n, I get an size error. I thought of going "manual" with splits and differentiated sampling and then bind again the rows, but is there a more efficient and direct way?


Solution

  • Using the size of the group n(), in sample_n.

    df %>% group_by(label) %>% sample_n(min(n(), 3))
    
    # A tibble: 6 x 3
    # Groups:   label [3]
    #  value label     n
    #  <int> <fct> <int>
    #1     1 a         2
    #2     2 a         2
    #3     3 b         1
    #4     5 c         7
    #5    10 c         7
    #6     4 c         7