Search code examples
rnestedtidyversedata-cleaning

How to un-nest dataframe items separated by `;`?


I have an un-tidy data frame saved as a tidyverse Tibble called data like this:

# A tibble: 4 × 3
  `Full name` `Favorite foods`                            `Preferred colors`
  <chr>       <chr>                                       <chr>             
1 Homer       key lime pie; celery; fried rice            green; red        
2 Marge       celery; ice cream                           NA                
3 Mr. Burns   fried rice; apple; fried chicken; ice cream orange; purple    
4 Krusty      celery; key lime pie; apple                 red; blue

It can be recreated like this:

data <- tribble(
    ~"Full name", ~"Favorite foods", ~"Preferred colors", 
    "Homer", "key lime pie; celery; fried rice", "green; red", 
    "Marge", "celery; ice cream", NA, 
    "Mr. Burns", "fried rice; apple; fried chicken; ice cream", "orange; purple", 
    "Krusty", "celery; key lime pie; apple", "red; blue"
)

As you can see, it lists people's Favorite foods and Preferred colors, where each element is a chr within which items are separated by ;, such as celery; ice cream.

Ultimately, I would like to separately analyze and visualize people's favorite foods and colors. Such as by being able to say, "The top two most favorite foods are x and y"; or "No one prefers black".

How do I tidy this data so that it reshapes/processes the foods and colors into a tidy data frame? For example, could the food and color columns be converted into their own nested data frames? Or is there a better/simpler way to go?

A tidyverse solution is not required but strongly preferred because that's what I am most familiar with. Computational performance is nice but not a strong consideration in this case.

EDIT: For future-proofing, I might also have columns in this data that I do not want to split.


Solution

  • You can use str_split to create list-columns:

    library(dplyr)
    library(stringr)
    
    data %>% 
      mutate(across(-`Full name`, ~ str_split(.x, pattern = "; ")))
    
    # A tibble: 4 × 3
      `Full name` `Favorite foods` `Preferred colors`
      <chr>       <list>           <list>            
    1 Homer       <chr [3]>        <chr [2]>         
    2 Marge       <chr [2]>        <chr [1]>         
    3 Mr. Burns   <chr [4]>        <chr [2]>         
    4 Krusty      <chr [3]>        <chr [2]>         
    

    To make it fully tidy, I'd recommend pivoting:

    data %>%
      mutate(across(-1, ~ str_split(.x, pattern = "; "))) %>% 
      pivot_longer(-1) %>% 
      unnest_longer("value")
    
    # A tibble: 19 × 3
       `Full name` name             value        
       <chr>       <chr>            <chr>        
     1 Homer       Favorite foods   key lime pie 
     2 Homer       Favorite foods   celery       
     3 Homer       Favorite foods   fried rice   
     4 Homer       Preferred colors green        
     5 Homer       Preferred colors red          
     6 Marge       Favorite foods   celery       
     7 Marge       Favorite foods   ice cream    
     8 Marge       Preferred colors NA           
     9 Mr. Burns   Favorite foods   fried rice   
    10 Mr. Burns   Favorite foods   apple        
    11 Mr. Burns   Favorite foods   fried chicken
    12 Mr. Burns   Favorite foods   ice cream    
    13 Mr. Burns   Preferred colors orange       
    14 Mr. Burns   Preferred colors purple       
    15 Krusty      Favorite foods   celery       
    16 Krusty      Favorite foods   key lime pie 
    17 Krusty      Favorite foods   apple        
    18 Krusty      Preferred colors red          
    19 Krusty      Preferred colors blue