This question is a follow up question to this one.
The following code creates a tibble that includes the empirical distributions of the iris
dataset columns Sepal.Length
, Sepal.Width
, Petal.Length
and Petal.Width
separately for the groups defined by the column Species
(i.e., the values setosa
, versicolor
, and virginica
).
# R 4.2
library(dplyr)
library(tibble)
library(tidyr)
library(purrr)
distributions <- split(iris[-5], iris$Species) |>
purrr::map(\(x) purrr::map(x, \(y) dplyr::count(x, {{y}}) |> mutate(n = proportions(n)))) |>
tibble::enframe(name = "Species") |>
tidyr::unnest_wider(value)
My goal is to use the tibble distributions
as a "lookup table" to randomly draw values from these empirical distributions for the shortened iris
dataset produced by the following code.
iris_new <- iris %>%
select(Species)
Thus, I would like to randomly draw from distributions
to create new "randomized" columns Sepal.Length
, Sepal.Width
, Petal.Length
and Petal.Width
incorporating the appropriate empirical distribution with respect to the value of the column Species
.
In a dataset without groups, I would simply use base::sample
. How can I achieve my goal in the presence of grouped data?
I first made a function that will make the distribution. To me, this is a bit cleaner than putting all the code in a call to map()
or something like that. The function make_dist()
makes a tibble with values and the probability with which that value occurs.
library(dplyr)
make_dist <- function(x){
x <- table(x)
x <- x/sum(x)
x <- tibble::tibble(vals = as.numeric(names(x)),
pct = c(unname(x)))
list(x)
}
Next, I'll make iris_new
- a tibble with just Species
where the group sizes are different from iris
.
data(iris)
iris_new <- tibble(Species = rep(c("setosa", "versicolor",
"virginica"), c(100, 74, 103)))
Next, make the distribution of each variable by species and save as iris_dist
.
iris_dist <- iris %>%
group_by(Species) %>%
reframe(across(everything(), ~make_dist(.x)))
The following should work whether or not iris_new
has other variables in it beside Species
. First the code counts the number of observations by Species
, then joins in the distributions. Next, we use the sample()
function from base
with the values and proportions that were in iris_dist
. Finally, hook the result back up with the original iris_new
with bind_cols()
.
iris_new %>%
group_by(Species) %>%
tally() %>%
left_join(iris_dist) %>%
group_by(Species) %>%
reframe(across(-n, ~sample(.x[[1]]$vals, n, replace=TRUE, prob=.x[[1]]$pct))) %>%
select(-Species) %>%
bind_cols(iris_new, .)
#> Joining with `by = join_by(Species)`
#> # A tibble: 277 × 5
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5 3.2 1.5 0.1
#> 2 setosa 5.1 3.4 1.6 0.3
#> 3 setosa 5.1 3.8 1.9 0.2
#> 4 setosa 4.6 3.1 1.6 0.1
#> 5 setosa 4.6 3.7 1.5 0.5
#> 6 setosa 4.8 4.1 1.6 0.2
#> 7 setosa 5.4 3 1.1 0.2
#> 8 setosa 5.7 3.1 1.3 0.2
#> 9 setosa 4.6 4.2 1.4 0.2
#> 10 setosa 4.8 3.5 1.3 0.5
#> # ℹ 267 more rows
Created on 2023-10-25 with reprex v2.0.2