I have a data set, in which observations are divided into groups and subgroups, and into types. Here's an example of such data:
data <- data.frame(group = sample(c(1:30), 10000, replace = TRUE),
subgroup = sample(c(1:100), 10000, replace = TRUE),
type = sample(c("A", "B"), 10000, replace = TRUE),
value = sample(c(1:100), 10000, replace = TRUE))
I need to sample equal Ns of each type, within each combination of group and subgroup.
In this example, as well as in the actual data, many combinations of group+subgroup+type are empty. Such combinations for which N = 0 are to be dropped.
My solution is as follows:
First, I find the N to be sampled for each group & subgroup. I count how many As and Bs are within each combination and choose the smaller number:
library(tidyverse)
n_to_sample <- data %>%
group_by(group, subgroup, type) %>%
summarise(n = n()) %>%
pivot_wider(id_cols = c("group", "subgroup"),
names_from = "type",
values_from = "n") %>%
rowwise() %>%
mutate(lowestN = min(A, B)) %>%
ungroup() %>%
filter(!is.na(lowestN)) %>%
select (group, subgroup, lowestN)
Second, I sample observations, according to the Ns counted earlier:
data_sampled <- data %>%
left_join(n_to_sample, by = c("group", "subgroup")) %>%
filter(!is.na(lowestN)) %>%
arrange(group, type) %>%
filter(row_number() %in% c(sample(which(type == "A"), mean(lowestN)),
sample(which(type == "B"), mean(lowestN))),
.by = c("group", "subgroup"))
To verify the procedure, I check that how many observations are left whithin each combination of group+subgroup+type:
data %>%
count(group, type, subgroup) %>%
arrange(group, subgroup, type) %>%
pivot_wider(id_cols = c("group", "subgroup"),
names_from = "type",
values_from = "n")
However, this check results in many NAs in the count variables. So my solution was not successful. Where did I go wrong? What can I do to correct it?
(NOTE: I'm not so interested in efficiency, which could probably be better. I just want it to work properly)
Here is one possible approach which first computes the lowestN
per group, subgroup and type, then splits the data by these variables, draws the sample and binds by row.
Note: I stripped down your example data to make it more minimal.
set.seed(1)
n <- 20
data <- data.frame(
group = sample(c(1:2), n, replace = TRUE),
subgroup = sample(c(1:2), n, replace = TRUE),
type = sample(c("A", "B"), n, replace = TRUE),
value = sample(c(1:100), n, replace = TRUE)
)
library(dplyr, warn = FALSE)
data |>
add_count(group, subgroup, type, name = "lowestN") |>
mutate(
lowestN = if (length(unique(type)) == 2) min(lowestN) else NA,
.by = c(group, subgroup)
) |>
filter(!is.na(lowestN)) |>
split(~ group + subgroup + type, drop = TRUE) |>
lapply(\(x) {
x[sample(nrow(x), size = unique(x$lowestN)), ]
}) |>
bind_rows() |>
arrange(group, subgroup, type)
#> group subgroup type value lowestN
#> 1 1 1 A 6 2
#> 2 1 1 A 70 2
#> 3 1 1 B 70 2
#> 4 1 1 B 51 2
#> 5 1 2 A 40 2
#> 6 1 2 A 65 2
#> 7 1 2 B 2 2
#> 8 1 2 B 87 2
#> 9 2 1 A 18 2
#> 10 2 1 A 75 2
#> 11 2 1 B 32 2
#> 12 2 1 B 42 2
#> 13 2 2 A 22 1
#> 14 2 2 B 81 1