I'm working with a large survey data frame where the response to each question is a number. For numeric survey questions, like age, the number is the number. But for multiple choice questions, the number is a code that corresponds with text that's saved in a separate lookup dataframe.
How can I replace all the numbers for each variable with their corresponding label from the lookup dataframe?
Example data:
df_numeric <-
tibble::tribble(
~gender, ~age, ~city, ~yearly_income, ~fav_colour, ~over_100_more_vars,
1, 22, 1, 55000, 1, "...",
2, 31, 2, 122000, 2, "...",
1, 41, 1, 101000, 2, "...",
2, 19, 5, 76000, 1, "...",
1, 64, 7, 32000, 6, "...")
df_lookup <-
tibble::tribble(
~variable, ~number, ~label,
"gender", 1, "Male",
"gender", 2, "Female",
"city", 1, "New York",
"city", 2, "Sydney",
"city", 5, "London",
"city", 7, "Paris",
"fav_colour", 1, "Red",
"fav_colour", 2, "Blue",
"fav_colour", 6, "Purple",
"one_of_100_more", 1, "Label",
"one_of_100_more", 2, "Label",
"two_of_100_more", 1, "Label",
"etc", 1, "etc")
What I'd ideally like to do is something like: check the variable name in df_numeric, look up that variable in the df_lookup, then for that specific variable, replace each 'number' with its corresponding 'label', then move on to the next variable, replace its numbers with their labels, move on to the next... which should look something like this
df_output <-
tibble::tribble(
~gender, ~age, ~city, ~yearly_income, ~fav_colour, ~over_100_more_vars,
"Male", 22, "New York", 55000, "Red", "...",
"Female", 31, "Sydney", 122000, "Blue", "...",
"Male", 41, "New York", 101000, "Blue", "...",
"Female", 19, "London", 76000, "Red", "...",
"Male", 64, "Paris", 32000, "Purple", "...")
Important caveats:
There are hundreds of variables, so writing out the name of each variable in the code isn't feasible (e.g. this answer).
We only need to replace the character variables like gender, city, etc. No need to replace the values of numeric variables like age and income because these are already in the correct format. These numeric variables already in the correct format aren't in df_lookup.
New edition
I would offer this tidyverse
solutuon (the current edition contains age processing):
library(tidyverse)
df_numeric %>%
mutate(across(-yearly_income, as.character)) %>%
pivot_longer(-c("yearly_income", "age") ) %>%
left_join(mutate(df_lookup, number = as.character(number)), by = c("name" = "variable", "value" = "number")) %>%
select(-value) %>%
pivot_wider(id_cols = c("yearly_income", "age"), values_from = label, names_from = name)
# A tibble: 5 x 6
yearly_income age gender city fav_colour over_100_more_vars
<dbl> <chr> <chr> <chr> <chr> <chr>
1 55000 22 Male New York Red <NA>
2 122000 31 Female Sydney Blue <NA>
3 101000 41 Male New York Blue <NA>
4 76000 19 Female London Red <NA>
5 32000 64 Male Paris Purple <NA>