Search code examples
rdataframesampling

R: sample with different sample sizes for groups


I have a data frame with 2 grouping columns V1 and V2. I want to sample exactly n = 4 elements for each distinct value in V1 and make sure that a minimum of m = 1 of each distinct element in V2 is sampled.

library(tidyverse)
set.seed(1)
df = data.frame(
  V1 = c(rep("A",6), rep("B",6)),
  V2 = c("C","C","D","D","E","E","F","F","G","G","H","H"),
  V3 = rnorm(12)
)

df
   V1 V2         V3
1   A  C -0.6264538
2   A  C  0.1836433
3   A  D -0.8356286
4   A  D  1.5952808
5   A  E  0.3295078
6   A  E -0.8204684
7   B  F  0.4874291
8   B  F  0.7383247
9   B  G  0.5757814
10  B  G -0.3053884
11  B  H  1.5117812
12  B  H  0.3898432

My desired output is for example ...

V1    V2        V3
1 A     C     -0.626
2 A     D     -0.836
3 A     E     -0.820
4 A     E      0.329
5 B     F      0.487
6 B     G      0.576
7 B     G     -0.305
8 B     H      0.390

I do not know how to generate this output. When I group by V1 and V2 I get n = 3 elements for each distinct value in V1.

df %>%
  group_by(V1,V2) %>%
  sample_n(1)

  V1    V2        V3
1 A     C     -0.626
2 A     D     -0.836
3 A     E     -0.820
4 B     F      0.487
5 B     G      0.576
6 B     H      0.390

The "splitstackshape" or "sampling" packages did not help.


Solution

  • Here is one approach :

    library(dplyr)
    
    nr <- 4
    first_pass <- df %>% group_by(V1, V2) %>% sample_n(1) %>% ungroup
    
    first_pass %>% 
      count(V1) %>% 
      mutate(n = nr - n) %>%
      left_join(df, by = 'V1') %>%
      group_by(V1) %>%
      sample_n(first(n)) %>%
      select(-n) %>%
      bind_rows(first_pass) %>%
      arrange(V1, V2)
    
    #  V1    V2        V3
    #  <chr> <chr>  <dbl>
    #1 A     C      0.184
    #2 A     D     -0.836
    #3 A     E     -0.820
    #4 A     E     -0.820
    #5 B     F      0.487
    #6 B     F      0.738
    #7 B     G     -0.305
    #8 B     H      0.390
    

    The logic is to first randomly select 1 row for each V1 and V2. We then calculate for each V1 how many more rows do we need to get nr rows and sample them randomly from each V1 and combine the final dataset.