I have an un-tidy data frame saved as a tidyverse Tibble called data
like this:
# A tibble: 4 × 3
`Full name` `Favorite foods` `Preferred colors`
<chr> <chr> <chr>
1 Homer key lime pie; celery; fried rice green; red
2 Marge celery; ice cream NA
3 Mr. Burns fried rice; apple; fried chicken; ice cream orange; purple
4 Krusty celery; key lime pie; apple red; blue
It can be recreated like this:
data <- tribble(
~"Full name", ~"Favorite foods", ~"Preferred colors",
"Homer", "key lime pie; celery; fried rice", "green; red",
"Marge", "celery; ice cream", NA,
"Mr. Burns", "fried rice; apple; fried chicken; ice cream", "orange; purple",
"Krusty", "celery; key lime pie; apple", "red; blue"
)
As you can see, it lists people's Favorite foods
and Preferred colors
, where each element is a chr
within which items are separated by ;
, such as celery; ice cream
.
Ultimately, I would like to separately analyze and visualize people's favorite foods and colors. Such as by being able to say, "The top two most favorite foods are x and y"; or "No one prefers black".
How do I tidy this data so that it reshapes/processes the foods and colors into a tidy data frame? For example, could the food and color columns be converted into their own nested data frames? Or is there a better/simpler way to go?
A tidyverse solution is not required but strongly preferred because that's what I am most familiar with. Computational performance is nice but not a strong consideration in this case.
EDIT: For future-proofing, I might also have columns in this data that I do not want to split.
You can use str_split
to create list-columns:
library(dplyr)
library(stringr)
data %>%
mutate(across(-`Full name`, ~ str_split(.x, pattern = "; ")))
# A tibble: 4 × 3
`Full name` `Favorite foods` `Preferred colors`
<chr> <list> <list>
1 Homer <chr [3]> <chr [2]>
2 Marge <chr [2]> <chr [1]>
3 Mr. Burns <chr [4]> <chr [2]>
4 Krusty <chr [3]> <chr [2]>
To make it fully tidy, I'd recommend pivoting:
data %>%
mutate(across(-1, ~ str_split(.x, pattern = "; "))) %>%
pivot_longer(-1) %>%
unnest_longer("value")
# A tibble: 19 × 3
`Full name` name value
<chr> <chr> <chr>
1 Homer Favorite foods key lime pie
2 Homer Favorite foods celery
3 Homer Favorite foods fried rice
4 Homer Preferred colors green
5 Homer Preferred colors red
6 Marge Favorite foods celery
7 Marge Favorite foods ice cream
8 Marge Preferred colors NA
9 Mr. Burns Favorite foods fried rice
10 Mr. Burns Favorite foods apple
11 Mr. Burns Favorite foods fried chicken
12 Mr. Burns Favorite foods ice cream
13 Mr. Burns Preferred colors orange
14 Mr. Burns Preferred colors purple
15 Krusty Favorite foods celery
16 Krusty Favorite foods key lime pie
17 Krusty Favorite foods apple
18 Krusty Preferred colors red
19 Krusty Preferred colors blue