I have a data frame with 2 grouping columns V1 and V2. I want to sample exactly n = 4 elements for each distinct value in V1 and make sure that a minimum of m = 1 of each distinct element in V2 is sampled.
library(tidyverse)
set.seed(1)
df = data.frame(
V1 = c(rep("A",6), rep("B",6)),
V2 = c("C","C","D","D","E","E","F","F","G","G","H","H"),
V3 = rnorm(12)
)
df
V1 V2 V3
1 A C -0.6264538
2 A C 0.1836433
3 A D -0.8356286
4 A D 1.5952808
5 A E 0.3295078
6 A E -0.8204684
7 B F 0.4874291
8 B F 0.7383247
9 B G 0.5757814
10 B G -0.3053884
11 B H 1.5117812
12 B H 0.3898432
My desired output is for example ...
V1 V2 V3
1 A C -0.626
2 A D -0.836
3 A E -0.820
4 A E 0.329
5 B F 0.487
6 B G 0.576
7 B G -0.305
8 B H 0.390
I do not know how to generate this output. When I group by V1 and V2 I get n = 3 elements for each distinct value in V1.
df %>%
group_by(V1,V2) %>%
sample_n(1)
V1 V2 V3
1 A C -0.626
2 A D -0.836
3 A E -0.820
4 B F 0.487
5 B G 0.576
6 B H 0.390
The "splitstackshape" or "sampling" packages did not help.
Here is one approach :
library(dplyr)
nr <- 4
first_pass <- df %>% group_by(V1, V2) %>% sample_n(1) %>% ungroup
first_pass %>%
count(V1) %>%
mutate(n = nr - n) %>%
left_join(df, by = 'V1') %>%
group_by(V1) %>%
sample_n(first(n)) %>%
select(-n) %>%
bind_rows(first_pass) %>%
arrange(V1, V2)
# V1 V2 V3
# <chr> <chr> <dbl>
#1 A C 0.184
#2 A D -0.836
#3 A E -0.820
#4 A E -0.820
#5 B F 0.487
#6 B F 0.738
#7 B G -0.305
#8 B H 0.390
The logic is to first randomly select 1 row for each V1
and V2
. We then calculate for each V1
how many more rows do we need to get nr
rows and sample them randomly from each V1
and combine the final dataset.