Search code examples
rdplyrpurrr

Using R, how do we split a dataframe to create a dynamic multi-tabbed excel file for each customer in the dataframe with a tab for each product


Given this dataframe:

library(tidyverse)
library(openxlsx)

df.1 <- tribble(
  ~customer  ,~period, ~cost1, ~cost2 , ~prod,
  'cust1',  '202201', 5, 10, 'online',
  'cust1',  '202202', 5, 10, 'online',
  'cust1',  '202203', 5, 10, 'in-person',
  'cust1',  '202204', 5, 10, 'in-person',
  'cust2',  '202203', 5, 10,'online',
  'cust2',  '202204', 5, 10, 'in-person',
  'cust2',  '202202', 5, 10,'online',
  'cust3',  '202204', 5, 10,'online'
 'cust4',  '202101', NA, NA, 'online',
  'cust4',  '202102', NA,10, 'online'
  
)

I am trying to create an excel file for each customer and dynamically add tabs based on the variable prod, where some files may have 1 tab and others more than 1 based upon how many products they have.

I would also like to remove any column where all values are NA so in this simple example, the file for cust4 would not have a column cost1.

Here is what I tried. It seems to split the file by customer and product, but it doesn't- create a file for each customer and it doesn't remove the column where all values are NA

list(tab1 = df.1) %>%
  map(\(df) split(df, list(df.1$customer, df.1$prod), drop =T)) %>%
  janitor::remove_empty(., which = 'cols') %>% 
  iwalk(\(df, nm) write.xlsx(df, paste0(nm, ".xlsx"), startRow = 6))

How can I accomplish this? I would expect 4 files with a tab for each product they actually have, and in cust4's case only 4 columns instead of 5


Solution

  • Here is one approach to achieve your desired result, which

    1. splits the dataset by customer
    2. uses iwalk to loop over the list of customer dataframes
    3. removes empty columns
    4. splits the customer df by prod
    5. exports the resulting list of product dataframes
    library(tidyverse)
    library(openxlsx)
    
    df.1 %>% 
      split(.$customer) %>%
      iwalk(\(df, nm) {
        df %>%
          janitor::remove_empty(which = 'cols') %>% 
          split(.$prod) |> 
          write.xlsx(paste0(nm, ".xlsx"), startRow = 6)
      })
    
    read.xlsx("cust4.xlsx", "online")
    #>   customer period cost2   prod
    #> 1    cust4 202101    NA online
    #> 2    cust4 202102    10 online
    

    And thanks to the comment by @MrFlick we could achieve the same result using the base pipe |> like so:

    df.1 |>  
      split(~customer) %>%
      iwalk(\(df, nm) {
        df |> 
          janitor::remove_empty(which = 'cols') |> 
          split(~prod) |> 
          write.xlsx(paste0(nm, ".xlsx"), startRow = 6)
      })
    

    EDIT You could remove empty cols per product by first splitting by prod and by using map to get rid of the empty cols like so:

    df.1 |>  
      split(~customer) %>%
      iwalk(\(df, nm) {
        df |> 
          split(~prod) |> 
          map(~janitor::remove_empty(.x, which = 'cols')) |>
          write.xlsx(paste0(nm, ".xlsx"), startRow = 6)
      })