I have data with a group variable, and I want to sample rows to end up with certain proportions in the group variable. This might require filtering rows, as the following example shows.
Simulating data
set.seed(2021)
my_df <-
data.frame(animal = sample(rep(c("dog", "cat", "rabbit"), times = c(150, 4100, 220))),
weight = sample(5:25, size = 4470, replace = TRUE))
> head(my_df)
## animal weight
## 1 cat 11
## 2 cat 24
## 3 cat 9
## 4 cat 20
## 5 cat 11
## 6 rabbit 9
Here we have data about 4470 animals, which could be either cat, dog, or rabbit, and the weight of each individual animal.
If we summarize the proportions of animals of each type we get:
library(dplyr)
my_df %>%
group_by(animal) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
## # A tibble: 3 x 3
## animal n freq
## * <chr> <int> <dbl>
## 1 cat 4100 0.917
## 2 dog 150 0.0336
## 3 rabbit 220 0.0492
We thus learned that in my_df
, 91.7% of the data are cats, 4.92% are rabbits, and 3.36% are dogs.
Desired Output: Sampling rows from the data to end up with other proportions over animal
column
I realized that my data in my_df
is not representative of the population I study, and therefore I want to sample rows to alter the proportions.
I want to end up with data that is comprised of 70% cats, 15% dogs, and 15% rabbits. Obviously, I'll need to throw away many of the cat
rows to reach such distribution.
Is there a simple way to reach such random sampling, to meet a desired proportion over a grouping variable?
EDIT
To clarify, in my_df
reaching the desired proportions between cat:dog:rabbit requires throwing away not only cats, but potentially dogs and rabbits too.
EDIT 2
In the comments, @Limey suggested this post, which is indeed relevant. However, I've tried applying this solution from there but it didn't give the expected output.
library(purrr)
group_slice_prop <- c(cat = 0.7, dog = 0.15, rabbit = 0.15)
output <-
my_df %>%
split(.$animal) %>%
imap_dfr(~ slice_sample(.x, prop = group_slice_prop[.y]))
We get that output
is:
output %>%
group_by(animal) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
## # A tibble: 3 x 3
## animal n freq
## * <chr> <int> <dbl>
## 1 cat 2870 0.981
## 2 dog 22 0.00752
## 3 rabbit 33 0.0113
But I was expecting to summarize output
and get:
# A tibble: 3 x 3
animal n freq
* <chr> <int> <dbl>
1 cat ? 0.70
2 dog ? 0.15
3 rabbit ? 0.15
EDIT 3
Both @AnilGoyal and @Chris Ruehlemann proposed solutions that do work in this case, but are somewhat limited to the toy data I provided. We could have thought about other scenarios with different, less intuitive proportions, or otherwise more levels in the group variable that require different math to figure out the n
per group. I want to avoid that. I wish to specify the desired blend of proportions in the group variable, and let the code decide how many rows to throw away from each group category to reach that blend.
EDITed answer in view of EDIT-3
#desired sample sizes
samp <- tibble(animal = c('cat', 'dog', 'rabbit'),
prop = c(0.70, 0.15, 0.15))
arrange(count(my_df, animal), n) %>% left_join(samp, by = "animal") %>%
mutate(n1 = first(n)/first(prop),
n = prop * n1) %>% select(-prop, -n1) %>%
right_join(my_df, by = "animal") %>%
group_split(animal) %>%
map_df(~sample_n(.x, size = first(n))) %>%
select(-n)
# A tibble: 1,000 x 2
animal weight
<chr> <int>
1 cat 19
2 cat 7
3 cat 17
4 cat 11
5 cat 22
6 cat 8
7 cat 22
8 cat 14
9 cat 22
10 cat 18
# ... with 990 more rows
Try this out on different df
set.seed(123)
my_df <-
data.frame(animal = sample(rep(c("dog", "cat", "rabbit"), times = c(1500, 4100, 220))),
weight = sample(5:25, size = 5820, replace = TRUE))
library(tidyverse)
samp <- tibble(animal = c('cat', 'dog', 'rabbit'),
prop = c(0.70, 0.15, 0.15))
arrange(count(my_df, animal), n) %>% left_join(samp, by = "animal") %>%
mutate(n1 = first(n)/first(prop),
n = prop * n1) %>% select(-prop, -n1) %>%
right_join(my_df, by = "animal") %>%
group_split(animal) %>%
map_df(~sample_n(.x, size = first(n))) %>%
select(-n) -> sampled
library(janitor)
tabyl(sampled$animal)
sampled$animal n percent
cat 1026 0.6998636
dog 220 0.1500682
rabbit 220 0.1500682