Search code examples
rdplyrpurrr

Create multiple list columns from data columns of a nested data frame


The goal is to create multiple list columns from data columns of a nested data frame. The following code achieves that goal. However, the code is quite long and I wonder if there is a possibility to shorten it by using tidyverse tools (dplyr, purrr etc.). In a non-nested data frame I would use, e. g., dplyr's across().

# R version 3.6.1

library(dplyr) # 1.0.7
library(tidyr) # 1.2.0


df_distribution <- iris %>% 
  dplyr::group_by(Species) %>% 
  tidyr::nest() %>% 
  dplyr::mutate(Sepal.Length = purrr::map(data, ~ dplyr::select(.x, Sepal.Length) %>% 
                                            dplyr::group_by(Sepal.Length) %>% 
                                            dplyr::summarise(n = n() ) %>% 
                                            dplyr::mutate(perc = n / sum(n) ) %>% 
                                            dplyr::select(-n) ) ) %>% 
  dplyr::mutate(Sepal.Width  = purrr::map(data, ~ dplyr::select(.x, Sepal.Width) %>% 
                                            dplyr::group_by(Sepal.Width) %>% 
                                            dplyr::summarise(n = n() ) %>% 
                                            dplyr::mutate(perc = n / sum(n) ) %>% 
                                            dplyr::select(-n) ) ) %>% 
  dplyr::mutate(Petal.Length = purrr::map(data, ~ dplyr::select(.x, Petal.Length) %>% 
                                            dplyr::group_by(Petal.Length) %>% 
                                            dplyr::summarise(n = n() ) %>% 
                                            dplyr::mutate(perc = n / sum(n) ) %>% 
                                            dplyr::select(-n) ) ) %>% 
  dplyr::mutate(Petal.Width  = purrr::map(data, ~ dplyr::select(.x, Petal.Width) %>% 
                                            dplyr::group_by(Petal.Width) %>% 
                                            dplyr::summarise(n = n() ) %>% 
                                            dplyr::mutate(perc = n / sum(n) ) %>% 
                                            dplyr::select(-n) ) )

My ultimate goal is to use the created empirical distributions to randomly draw from them. However, that step is not part of the provided code but I would appreciate any pointer to helpful ressources for that, too.


Solution

  • Combining dplyr::count with proportions in a double purrr::map, and then with tibble::enframe and tidyr::unnest_wider to get a column-list format, you can do this:

    split(iris[-5], iris$Species) |> 
      purrr::map(\(x) purrr::map(x, \(y) dplyr::count(x, {{y}}) |> mutate(n = proportions(n)))) |> 
      tibble::enframe(name = "Species") |> 
      tidyr::unnest_wider(value)
    
    # # A tibble: 3 × 5
    #   Species    Sepal.Length  Sepal.Width   Petal.Length  Petal.Width  
    #   <chr>      <list>        <list>        <list>        <list>       
    # 1 setosa     <df [15 × 2]> <df [16 × 2]> <df [9 × 2]>  <df [6 × 2]> 
    # 2 versicolor <df [21 × 2]> <df [14 × 2]> <df [19 × 2]> <df [9 × 2]> 
    # 3 virginica  <df [21 × 2]> <df [13 × 2]> <df [20 × 2]> <df [12 × 2]>
    

    Modulo some column names differences, this provide the correct answer in the correct format:

    identical(df_distribution$Sepal.Length$n, sol$Sepal.Length$n)
    #[1] TRUE