Search code examples
rdplyrsampling

How do I sample specific sizes within groups?


I have a specific use problem. I want to sample exact sizes from within groups. What method should I use to construct exact subsets based on group counts?

My use case is that I am going through a two-stage sample design. First, for each group in my population, I want to ensure that 60% of subjects will not be selected. So I am trying to construct a sampling data frame that excludes 60% of available subjects for each group. Further, this is a function where the user specifies the minimum proportion of subjects that must not be used, hence the 1- construction where the user has indicated that at least 60% of subjects in each group cannot be selected for sampling.

After this code, I will be sampling completely at random, to get my final sample.

Code example:

testing <- data.frame(ID = c(seq_len(50)), Age = c(rep(18, 10), rep(19, 9), rep(20,15), rep(21,16)))

testing <- testing %>%
slice_sample(ID, prop=1-.6)

As you can see, the numbers by group are not what I want. I should only have 4 subjects who are 18 years of age, 3 subjects who are 19 years, 6 subjects who are 20 years of age, and 6 subjects who are 21 years of age. With no set seed, the numbers I ended up with were 6 18-year-olds, 1 19-year-old, 6 20-year-olds, and 7 21-year-olds.

However, the overall sample size of 20 is correct.

How do I brute force the sample size within the groups to be what I need?

There are other variables in the data frame so I need to sample randomly from each age group.

EDIT: Messed up trying to give an example. In my real data I am grouping by age inside the dplyr set of commands. But neither group-by([Age variable) ahead of slice_sample() or doing the grouping inside slice_sample() work. In my real data, I get neither the correct set of samples by age, nor do I get the correct overall sample size.

I was using a semi_join to limit the ages to those that had a total remaining after doing the proportion test. For those ages for which no sample could be taken, the semi_join was being used to remove those ages from the population ahead of doing the proportional sampling. I don't know if the semi_join has caused the problem.

That said, the answer provided and accepted shifts me away from relying on the semi_join and I think is an overall large improvement to my real code.


Solution

  • You haven't defined your grouping variable.

    Try the following:

    set.seed(1)
    x <- testing %>% group_by(Age) %>% slice_sample(prop = .4)
    x %>% count()
    # # A tibble: 4 x 2
    # # Groups:   Age [4]
    #     Age     n
    #   <dbl> <int>
    # 1    18     4
    # 2    19     3
    # 3    20     6
    # 4    21     6
    

    Alternatively, try stratified from my "splitstackshape" package:

    library(splitstackshape)
    set.seed(1)
    y <- stratified(testing, "Age", .4)
    y[, .N, Age]
    #    Age N
    # 1:  18 4
    # 2:  19 4
    # 3:  20 6
    # 4:  21 6