Search code examples
rdplyrtidyr

Randomly draw values from an empirical distribution by group for multiple columns


This question is a follow up question to this one.

The following code creates a tibble that includes the empirical distributions of the iris dataset columns Sepal.Length, Sepal.Width, Petal.Length and Petal.Width separately for the groups defined by the column Species (i.e., the values setosa, versicolor, and virginica).

# R 4.2
library(dplyr)
library(tibble)
library(tidyr)
library(purrr)

distributions <- split(iris[-5], iris$Species) |> 
  purrr::map(\(x) purrr::map(x, \(y) dplyr::count(x, {{y}}) |> mutate(n = proportions(n)))) |> 
  tibble::enframe(name = "Species") |> 
  tidyr::unnest_wider(value)

My goal is to use the tibble distributions as a "lookup table" to randomly draw values from these empirical distributions for the shortened iris dataset produced by the following code.

iris_new <- iris %>% 
  select(Species)

Thus, I would like to randomly draw from distributions to create new "randomized" columns Sepal.Length, Sepal.Width, Petal.Length and Petal.Width incorporating the appropriate empirical distribution with respect to the value of the column Species.

In a dataset without groups, I would simply use base::sample. How can I achieve my goal in the presence of grouped data?


Solution

  • I first made a function that will make the distribution. To me, this is a bit cleaner than putting all the code in a call to map() or something like that. The function make_dist() makes a tibble with values and the probability with which that value occurs.

    library(dplyr)
    make_dist <- function(x){
      x <- table(x)
      x <- x/sum(x)
      x <- tibble::tibble(vals = as.numeric(names(x)), 
                      pct = c(unname(x)))
      list(x)
    }
    

    Next, I'll make iris_new - a tibble with just Species where the group sizes are different from iris.

    data(iris)
    iris_new <- tibble(Species = rep(c("setosa", "versicolor", 
                      "virginica"), c(100, 74, 103)))
    

    Next, make the distribution of each variable by species and save as iris_dist.

    iris_dist <- iris %>% 
      group_by(Species) %>% 
      reframe(across(everything(), ~make_dist(.x))) 
    

    The following should work whether or not iris_new has other variables in it beside Species. First the code counts the number of observations by Species, then joins in the distributions. Next, we use the sample() function from base with the values and proportions that were in iris_dist. Finally, hook the result back up with the original iris_new with bind_cols().

    iris_new %>% 
      group_by(Species) %>% 
      tally() %>% 
      left_join(iris_dist) %>% 
      group_by(Species) %>% 
      reframe(across(-n, ~sample(.x[[1]]$vals, n, replace=TRUE, prob=.x[[1]]$pct))) %>% 
      select(-Species) %>% 
      bind_cols(iris_new, .)
    #> Joining with `by = join_by(Species)`
    #> # A tibble: 277 × 5
    #>    Species Sepal.Length Sepal.Width Petal.Length Petal.Width
    #>    <chr>          <dbl>       <dbl>        <dbl>       <dbl>
    #>  1 setosa           5           3.2          1.5         0.1
    #>  2 setosa           5.1         3.4          1.6         0.3
    #>  3 setosa           5.1         3.8          1.9         0.2
    #>  4 setosa           4.6         3.1          1.6         0.1
    #>  5 setosa           4.6         3.7          1.5         0.5
    #>  6 setosa           4.8         4.1          1.6         0.2
    #>  7 setosa           5.4         3            1.1         0.2
    #>  8 setosa           5.7         3.1          1.3         0.2
    #>  9 setosa           4.6         4.2          1.4         0.2
    #> 10 setosa           4.8         3.5          1.3         0.5
    #> # ℹ 267 more rows
    

    Created on 2023-10-25 with reprex v2.0.2