Search code examples
rtidyversesampling

R: sample equal Ns, within groups


I have a data set, in which observations are divided into groups and subgroups, and into types. Here's an example of such data:

data <- data.frame(group    = sample(c(1:30),    10000, replace = TRUE),
                   subgroup = sample(c(1:100),   10000, replace = TRUE),
                   type    = sample(c("A", "B"), 10000, replace = TRUE),
                   value   = sample(c(1:100),    10000, replace = TRUE))

I need to sample equal Ns of each type, within each combination of group and subgroup.

In this example, as well as in the actual data, many combinations of group+subgroup+type are empty. Such combinations for which N = 0 are to be dropped.

My solution is as follows:


First, I find the N to be sampled for each group & subgroup. I count how many As and Bs are within each combination and choose the smaller number:

library(tidyverse)

n_to_sample <- data %>%
  group_by(group, subgroup, type) %>% 
  summarise(n = n()) %>%
  pivot_wider(id_cols = c("group", "subgroup"),
              names_from =  "type",
              values_from = "n") %>%
  rowwise() %>%
  mutate(lowestN = min(A, B)) %>%
  ungroup() %>% 
  filter(!is.na(lowestN)) %>%
  select (group, subgroup, lowestN)

Second, I sample observations, according to the Ns counted earlier:

data_sampled <- data %>%
  left_join(n_to_sample, by = c("group", "subgroup")) %>% 
  filter(!is.na(lowestN)) %>%
  arrange(group, type) %>%
  filter(row_number() %in% c(sample(which(type == "A"), mean(lowestN)),
                             sample(which(type == "B"), mean(lowestN))),
         .by = c("group", "subgroup")) 

To verify the procedure, I check that how many observations are left whithin each combination of group+subgroup+type:

data %>%
  count(group, type, subgroup) %>%
  arrange(group, subgroup, type) %>%
  pivot_wider(id_cols = c("group", "subgroup"),
              names_from =  "type",
              values_from = "n")

However, this check results in many NAs in the count variables. So my solution was not successful. Where did I go wrong? What can I do to correct it?

(NOTE: I'm not so interested in efficiency, which could probably be better. I just want it to work properly)


Solution

  • Here is one possible approach which first computes the lowestN per group, subgroup and type, then splits the data by these variables, draws the sample and binds by row.

    Note: I stripped down your example data to make it more minimal.

    set.seed(1)
    
    n <- 20
    data <- data.frame(
      group = sample(c(1:2), n, replace = TRUE),
      subgroup = sample(c(1:2), n, replace = TRUE),
      type = sample(c("A", "B"), n, replace = TRUE),
      value = sample(c(1:100), n, replace = TRUE)
    )
    
    library(dplyr, warn = FALSE)
    
    data |>
      add_count(group, subgroup, type, name = "lowestN") |>
      mutate(
        lowestN = if (length(unique(type)) == 2) min(lowestN) else NA,
        .by = c(group, subgroup)
      ) |>
      filter(!is.na(lowestN)) |> 
      split(~ group + subgroup + type, drop = TRUE) |>
      lapply(\(x) {
        x[sample(nrow(x), size = unique(x$lowestN)), ]
      }) |>
      bind_rows() |> 
      arrange(group, subgroup, type)
    #>    group subgroup type value lowestN
    #> 1      1        1    A     6       2
    #> 2      1        1    A    70       2
    #> 3      1        1    B    70       2
    #> 4      1        1    B    51       2
    #> 5      1        2    A    40       2
    #> 6      1        2    A    65       2
    #> 7      1        2    B     2       2
    #> 8      1        2    B    87       2
    #> 9      2        1    A    18       2
    #> 10     2        1    A    75       2
    #> 11     2        1    B    32       2
    #> 12     2        1    B    42       2
    #> 13     2        2    A    22       1
    #> 14     2        2    B    81       1