The goal is to create multiple list columns from data columns of a nested data frame. The following code achieves that goal. However, the code is quite long and I wonder if there is a possibility to shorten it by using tidyverse tools (dplyr
, purrr
etc.). In a non-nested data frame I would use, e. g., dplyr
's across()
# R version 3.6.1
library(dplyr) # 1.0.7
library(tidyr) # 1.2.0
df_distribution <- iris %>%
dplyr::group_by(Species) %>%
tidyr::nest() %>%
dplyr::mutate(Sepal.Length = purrr::map(data, ~ dplyr::select(.x, Sepal.Length) %>%
dplyr::group_by(Sepal.Length) %>%
dplyr::summarise(n = n() ) %>%
dplyr::mutate(perc = n / sum(n) ) %>%
dplyr::select(-n) ) ) %>%
dplyr::mutate(Sepal.Width = purrr::map(data, ~ dplyr::select(.x, Sepal.Width) %>%
dplyr::group_by(Sepal.Width) %>%
dplyr::summarise(n = n() ) %>%
dplyr::mutate(perc = n / sum(n) ) %>%
dplyr::select(-n) ) ) %>%
dplyr::mutate(Petal.Length = purrr::map(data, ~ dplyr::select(.x, Petal.Length) %>%
dplyr::group_by(Petal.Length) %>%
dplyr::summarise(n = n() ) %>%
dplyr::mutate(perc = n / sum(n) ) %>%
dplyr::select(-n) ) ) %>%
dplyr::mutate(Petal.Width = purrr::map(data, ~ dplyr::select(.x, Petal.Width) %>%
dplyr::group_by(Petal.Width) %>%
dplyr::summarise(n = n() ) %>%
dplyr::mutate(perc = n / sum(n) ) %>%
dplyr::select(-n) ) )
My ultimate goal is to use the created empirical distributions to randomly draw from them. However, that step is not part of the provided code but I would appreciate any pointer to helpful ressources for that, too.
Combining dplyr::count
with proportions
in a double purrr::map
, and then with tibble::enframe
and tidyr::unnest_wider
to get a column-list format, you can do this:
split(iris[-5], iris$Species) |>
purrr::map(\(x) purrr::map(x, \(y) dplyr::count(x, {{y}}) |> mutate(n = proportions(n)))) |>
tibble::enframe(name = "Species") |>
# # A tibble: 3 × 5
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# <chr> <list> <list> <list> <list>
# 1 setosa <df [15 × 2]> <df [16 × 2]> <df [9 × 2]> <df [6 × 2]>
# 2 versicolor <df [21 × 2]> <df [14 × 2]> <df [19 × 2]> <df [9 × 2]>
# 3 virginica <df [21 × 2]> <df [13 × 2]> <df [20 × 2]> <df [12 × 2]>
Modulo some column names differences, this provide the correct answer in the correct format:
identical(df_distribution$Sepal.Length$n, sol$Sepal.Length$n)
#[1] TRUE