Search code examples
rdplyrsample

Sample from a data frame using group-specific sample sizes


I want to sample rows from a data frame using unequal sample sizes from each group.

Let's say we have a simple data frame grouped by 'group':

library(dplyr)
set.seed(123)

df <- data.frame(group = rep(c("A", "B"), each = 10), 
                 value = rnorm(10))
df
#>    group       value
#> 1      A -0.56047565
#> 2      A -0.23017749
#> .....
#> 10     A -0.44566197
#> 11     B -0.56047565
#> 12     B -0.23017749
#> .....
#> 20     B -0.44566197

Using the slice_sample function from the dplyr package, you can easily slice equally sized groups from this dataframe:

df %>% group_by(group) %>% slice_sample(n = 2) %>% ungroup()

#> # A tibble: 4 x 2
#>   group  value
#>   <fct>  <dbl>
#> 1 A     -0.687
#> 2 A     -0.446
#> 3 B     -0.687
#> 4 B      1.56

Question

How do you sample a different number of values from each group (slice groups that are not equal in size)? For example, sample 4 rows from group A, and 5 rows from group B?


Solution

  • The easiest thing I can think of is a map2 solution using purrr.

    library(dplyr)
    library(purrr)
    
    df %>% 
      group_split(group) %>% 
      map2_dfr(c(4, 5), ~ slice_sample(.x, n = .y))
    
    # A tibble: 9 x 2
      group   value
      <chr>   <dbl>
    1 A     -0.687 
    2 A      1.56  
    3 A      0.0705
    4 A      1.72  
    5 B     -0.560 
    6 B      0.461 
    7 B      0.129 
    8 B      0.0705
    9 B     -0.230 
    

    A caution is that you need to understand the order of the split. I think group_split() will sort the group as factors. A way around that would be to adapt like this, and lookup the n from a named vector.

    group_slice_n <- c(A = 4, B = 5)
    
    df %>% 
      split(.$group) %>% 
      imap_dfr(~ slice_sample(.x, n = group_slice_n[.y]))