Search code examples
rdplyrtidyverse

How to replace cell values using a variable + value lookup in another dataframe?


I'm working with a large survey data frame where the response to each question is a number. For numeric survey questions, like age, the number is the number. But for multiple choice questions, the number is a code that corresponds with text that's saved in a separate lookup dataframe.

How can I replace all the numbers for each variable with their corresponding label from the lookup dataframe?

Example data:

df_numeric <- 
  tibble::tribble(
    ~gender, ~age, ~city, ~yearly_income, ~fav_colour,  ~over_100_more_vars,
          1,   22,     1,          55000,           1,                "...",
          2,   31,     2,         122000,           2,                "...",
          1,   41,     1,         101000,           2,                "...",
          2,   19,     5,          76000,           1,                "...",
          1,   64,     7,          32000,           6,                "...")
    
df_lookup <- 
  tibble::tribble(
           ~variable, ~number,        ~label,
            "gender",       1,        "Male",
            "gender",       2,      "Female",
              "city",       1,    "New York", 
              "city",       2,      "Sydney",
              "city",       5,      "London",
              "city",       7,       "Paris",
        "fav_colour",       1,         "Red",
        "fav_colour",       2,        "Blue",
        "fav_colour",       6,      "Purple",
   "one_of_100_more",       1,       "Label",
   "one_of_100_more",       2,       "Label",
   "two_of_100_more",       1,       "Label",
               "etc",       1,         "etc")

What I'd ideally like to do is something like: check the variable name in df_numeric, look up that variable in the df_lookup, then for that specific variable, replace each 'number' with its corresponding 'label', then move on to the next variable, replace its numbers with their labels, move on to the next... which should look something like this

df_output <- 
  tibble::tribble(
    ~gender, ~age,      ~city, ~yearly_income, ~fav_colour,  ~over_100_more_vars,
    "Male",   22,  "New York",          55000,       "Red",                "...",
  "Female",   31,    "Sydney",         122000,      "Blue",                "...",
    "Male",   41,  "New York",         101000,      "Blue",                "...",
  "Female",   19,    "London",          76000,       "Red",                "...",
    "Male",   64,     "Paris",          32000,    "Purple",                "...")

Important caveats:

  • There are hundreds of variables, so writing out the name of each variable in the code isn't feasible (e.g. this answer).

  • We only need to replace the character variables like gender, city, etc. No need to replace the values of numeric variables like age and income because these are already in the correct format. These numeric variables already in the correct format aren't in df_lookup.


Solution

  • New edition I would offer this tidyverse solutuon (the current edition contains age processing):

    library(tidyverse) 
    df_numeric %>% 
      mutate(across(-yearly_income, as.character)) %>% 
      pivot_longer(-c("yearly_income", "age") ) %>% 
      left_join(mutate(df_lookup, number = as.character(number)), by = c("name" = "variable", "value" = "number")) %>% 
      select(-value) %>% 
      pivot_wider(id_cols = c("yearly_income", "age"), values_from = label, names_from = name)
    
    # A tibble: 5 x 6
      yearly_income age   gender city     fav_colour over_100_more_vars
              <dbl> <chr> <chr>  <chr>    <chr>      <chr>             
    1         55000 22    Male   New York Red        <NA>              
    2        122000 31    Female Sydney   Blue       <NA>              
    3        101000 41    Male   New York Blue       <NA>              
    4         76000 19    Female London   Red        <NA>              
    5         32000 64    Male   Paris    Purple     <NA>