Search code examples
rdataframefilterdplyrsample

How to randomly filter rows to achieve desired proportions of a grouping variable


I have data with a group variable, and I want to sample rows to end up with certain proportions in the group variable. This might require filtering rows, as the following example shows.

Simulating data

set.seed(2021)

my_df <-
  data.frame(animal = sample(rep(c("dog", "cat", "rabbit"), times = c(150, 4100, 220))),
           weight = sample(5:25, size = 4470, replace = TRUE))

> head(my_df)
##   animal weight
## 1    cat     11
## 2    cat     24
## 3    cat      9
## 4    cat     20
## 5    cat     11
## 6 rabbit      9

Here we have data about 4470 animals, which could be either cat, dog, or rabbit, and the weight of each individual animal.

If we summarize the proportions of animals of each type we get:

library(dplyr)

my_df %>%
  group_by(animal) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))

## # A tibble: 3 x 3
##   animal     n   freq
## * <chr>  <int>  <dbl>
## 1 cat     4100 0.917 
## 2 dog      150 0.0336
## 3 rabbit   220 0.0492

We thus learned that in my_df, 91.7% of the data are cats, 4.92% are rabbits, and 3.36% are dogs.


Desired Output: Sampling rows from the data to end up with other proportions over animal column
I realized that my data in my_df is not representative of the population I study, and therefore I want to sample rows to alter the proportions.

I want to end up with data that is comprised of 70% cats, 15% dogs, and 15% rabbits. Obviously, I'll need to throw away many of the cat rows to reach such distribution.

Is there a simple way to reach such random sampling, to meet a desired proportion over a grouping variable?


EDIT


To clarify, in my_df reaching the desired proportions between cat:dog:rabbit requires throwing away not only cats, but potentially dogs and rabbits too.


EDIT 2


In the comments, @Limey suggested this post, which is indeed relevant. However, I've tried applying this solution from there but it didn't give the expected output.

library(purrr)

group_slice_prop <- c(cat = 0.7, dog = 0.15, rabbit = 0.15)

output <-
  my_df %>% 
  split(.$animal) %>% 
  imap_dfr(~ slice_sample(.x, prop = group_slice_prop[.y]))

We get that output is:

output %>%
  group_by(animal) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))

## # A tibble: 3 x 3
##   animal     n    freq
## * <chr>  <int>   <dbl>
## 1 cat     2870 0.981  
## 2 dog       22 0.00752
## 3 rabbit    33 0.0113 

But I was expecting to summarize output and get:

# A tibble: 3 x 3
  animal     n    freq
* <chr>  <int>   <dbl>
1 cat       ?     0.70  
2 dog       ?     0.15
3 rabbit    ?     0.15 

EDIT 3


Both @AnilGoyal and @Chris Ruehlemann proposed solutions that do work in this case, but are somewhat limited to the toy data I provided. We could have thought about other scenarios with different, less intuitive proportions, or otherwise more levels in the group variable that require different math to figure out the n per group. I want to avoid that. I wish to specify the desired blend of proportions in the group variable, and let the code decide how many rows to throw away from each group category to reach that blend.


Solution

  • EDITed answer in view of EDIT-3

    #desired sample sizes
    samp <- tibble(animal = c('cat', 'dog', 'rabbit'), 
           prop = c(0.70, 0.15, 0.15)) 
    
    arrange(count(my_df, animal), n) %>% left_join(samp, by = "animal") %>%
      mutate(n1 = first(n)/first(prop),
             n = prop * n1) %>% select(-prop, -n1) %>%
      right_join(my_df, by = "animal") %>%
      group_split(animal) %>%
      map_df(~sample_n(.x, size = first(n))) %>%
      select(-n)
    # A tibble: 1,000 x 2
       animal weight
       <chr>   <int>
     1 cat        19
     2 cat         7
     3 cat        17
     4 cat        11
     5 cat        22
     6 cat         8
     7 cat        22
     8 cat        14
     9 cat        22
    10 cat        18
    # ... with 990 more rows
    
    

    Try this out on different df

    set.seed(123)
    my_df <-
      data.frame(animal = sample(rep(c("dog", "cat", "rabbit"), times = c(1500, 4100, 220))),
                 weight = sample(5:25, size = 5820, replace = TRUE))
    
    library(tidyverse)
    samp <- tibble(animal = c('cat', 'dog', 'rabbit'), 
           prop = c(0.70, 0.15, 0.15)) 
    
    arrange(count(my_df, animal), n) %>% left_join(samp, by = "animal") %>%
      mutate(n1 = first(n)/first(prop),
             n = prop * n1) %>% select(-prop, -n1) %>%
      right_join(my_df, by = "animal") %>%
      group_split(animal) %>%
      map_df(~sample_n(.x, size = first(n))) %>%
      select(-n) -> sampled
    
    library(janitor)  
    tabyl(sampled$animal)
    
     sampled$animal    n   percent
                cat 1026 0.6998636
                dog  220 0.1500682
             rabbit  220 0.1500682