I have a large data frame (hundreds of variables wide) in which all values of categorical variables are saved as numerics, for example, 1, 2, 8, representing no, yes, and unknown. However, this is not always consistent. There are variables that have ten or more categories with 88 representing unknown etc.
data <- data.frame("ID" = c(1:5),
"Var1" = c(2,2,8,1,8),
"Var2" = c(5,8,4,88,10))
For each variable, I do have all information on which value represents which category. Currently, I have this information stored in vectors that are each correctly ordered, like
> Var1_values
[1] 8 2 1
with a corresponding vector containing the categories:
> Var1_categories
[1] "unknown" "yes" "no"
But I cannot figure out a process for how to bring this information together in order to automate the recoding process towards an expected result like
| ID | Var1 | Var2 |
|----|---------|-------------------|
| 1 | yes | condition E |
| 2 | yes | condition H |
| 3 | unknown | condition D |
| 4 | no | unknown condition |
| 5 | unknown | condition H |
where each column is a meaningful factor variable.
As I said, the data frame is very wide and things might change internally, so doing this manually is not an option. I feel like I'm being stupid as I have all the necessary information readily available, so any insight would be greatly appreciated, and a cup of coffee is the least I can do for helpful advice.
// edit:
I forgot to mention that I have already made some kind of a mapping-dataframe but I couldn't really put it to use, yet. It looks like this:
mapping <- data.frame("Variable" = c("Var1", "Var2", "Var3", "Var4"),
"Value1" = c(2,2,2,7),
"Word1" = c("yes","yes","yes","condition A"),
"Value2" = c(1,1,1,6),
"Word2" = c("no","no","no","Condition B"),
"Value3" = c(8,8,8,5),
"Word3" = c("unk","unk","unk", "Condition C"),
"Value4" = c(NA,NA,NA,4),
"Word4" = c(NA,NA,NA,"Condition B")
)
I would like to "long"-transform it so I can use it with @r2evan 's solution.
Here's one thought, though it requires reshaping (twice) the data.
mapping <- data.frame(
Var = c(rep("Var1", 3), rep("Var2", 5)),
Val = c(1, 2, 8, 4, 5, 8, 10, 88),
Words = c("no", "yes", "unk", "D", "E", "H", "H", "unk")
)
mapping
# Var Val Words
# 1 Var1 1 no
# 2 Var1 2 yes
# 3 Var1 8 unk
# 4 Var2 4 D
# 5 Var2 5 E
# 6 Var2 8 H
# 7 Var2 10 H
# 8 Var2 88 unk
library(dplyr)
library(tidyr) # pivot_*
data %>%
pivot_longer(-ID, names_to = "Var", values_to = "Val") %>%
left_join(mapping, by = c("Var", "Val")) %>%
pivot_wider(ID, names_from = "Var", values_from = "Words")
# # A tibble: 5 x 3
# ID Var1 Var2
# <int> <chr> <chr>
# 1 1 yes E
# 2 2 yes H
# 3 3 unk D
# 4 4 no unk
# 5 5 unk H
With this method, you control the number-to-words mapping for each variable.
Another option is to use a map list, similar to above but it does not require double-reshaping.
maplist <- list(
Var1 = c("1" = "no", "2" = "yes", "8" = "unk"),
Var2 = c("4" = "D", "5" = "E", "8" = "H", "10" = "H", "88" = "unk")
)
maplist
# $Var1
# 1 2 8
# "no" "yes" "unk"
# $Var2
# 4 5 8 10 88
# "D" "E" "H" "H" "unk"
nms <- c("Var1", "Var2")
data[,nms] <- Map(function(val, lookup) lookup[as.character(val)],
data[nms], maplist[nms])
data
# ID Var1 Var2
# 1 1 yes E
# 2 2 yes H
# 3 3 unk D
# 4 4 no unk
# 5 5 unk H
Between the two, I think I prefer the first if your data doesn't punish you for reshaping it (many things could make this less appealing). One reason it's good is that maintaining the mapping
can be as easy as maintaining a CSV (which might be done in your favorite spreadsheet tool, e.g., Excel or Calc).