Search code examples
rdplyr

Can I subsample different sizes per group with dplyr?


Okay, so I know I could do something like this,

mtcars %>% 
group_by(cyl) %>% 
sample_n(2)

which will give me,

Source: local data frame [6 x 11]
Groups: cyl [3]

 mpg   cyl  disp    hp  drat    wt  qsec    vs    am
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21.4     4 121.0   109  4.11 2.780 18.60     1     1
2  33.9     4  71.1    65  4.22 1.835 19.90     1     1
3  18.1     6 225.0   105  2.76 3.460 20.22     1     0
4  21.0     6 160.0   110  3.90 2.875 17.02     0     1
5  15.2     8 304.0   150  3.15 3.435 17.30     0     0
6  10.4     8 460.0   215  3.00 5.424 17.82     0     0
# ... with 2 more variables: gear <dbl>, carb <dbl>

so 2 samples per cylinder. This looks cool. However, there is a way to set a vector of sizes matching unique elements of the grouping feature so I can get n = 1 for cars with 4 cylinder, n=10 for cars with 6 cyl and so on?

Thanks!


Solution

  • Do each individually and then bind them together. I assume you're already in dplyr:

    bind_rows(
      mtcars %>% 
        group_by(cyl) %>%
        filter(cyl==4) %>%
        sample_n(1),
      mtcars %>% 
        group_by(cyl) %>%
        filter(cyl==6) %>%
        sample_n(6))
    

    We can't do 10 rows of cyl==6 because there's only 6 ;)