I have two files, one of which being a .csv file and the other an Excel document with a data dictionary in it. I want to label the values of the .csv file with the values from the data dictionary, so that the data frame looks like one loaded from a .dta/.sav file (in that the cells don't contain the values as a character/factor, but a labelled value). Is there a way of assigning value labels to cells from a data dictionary?"
Dummy data are below. I did work out how to do it using match_df
from {matchmaker}
, but this put all the values in as character rather than labelling them.
# where dd is the data dictionary and df the data frame
dd <- data.frame(variable.name = c(rep("sex", 2),
rep("age", 1),
rep("school", 3),
rep("parent", 2)),
value = c(1, 2,
-96,
1, 2, -96,
1, 2),
variable.label = c("male", "female",
NA,
"private", "state", "missing",
"mother", "father"))
df <- data.frame(id = 1:10,
sex = rep(c(1, 2), 5),
age = c(rep(5, 9), -96),
school = as.integer(c(rep(c(1, 2, 3), 3), 1)),
parent = c(rep(1, 2), rep(2, 8)))
# the following works to an extent, but it's not the desired outcome:
match_df(df,
dictionary = dd,
from = "value",
to = "variable.label",
by = "variable.name")
# desired outcome would look something like this:
dput()
structure(list(id = 1:10, sex = structure(c(1, 2, 1, 2, 1, 2,
1, 2, 1, 2), labels = c(male = 1, female = 2), class = c("haven_labelled",
"vctrs_vctr", "double")), age = structure(c(5, 5, 5, 5, 5, 5,
5, 5, 5, -96), labels = c(missing = -96), class = c("haven_labelled",
"vctrs_vctr", "double")), school = structure(c(1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L), labels = structure(1:3, names = c("private",
"state", "missing")), class = c("haven_labelled", "vctrs_vctr",
"integer")), parent = structure(c(1, 1, 2, 2, 2, 2, 2, 2, 2,
2), labels = c(mother = 1, father = 2), class = c("haven_labelled",
"vctrs_vctr", "double"))), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
I think the first thing to do is split(dd)
into a list of data frames, each of which contains the appropriate factor levels for one column.
levels_list <- split(dd, ~variable.name)
# This looks like:
# $age
# variable.name value variable.label
# 3 age -96 <NA>
# $parent
# variable.name value variable.label
# 7 parent 1 mother
# 8 parent 2 father
# <etc>
You can then write a function to construct the haven_labelled
class which takes one of these data frames and df
and returns a labelled vector.
create_haven_labelled <- function(col, dat = df, ldf = levels_list[[col]]) {
levels <- unique(dat[[col]])
labels <- ldf[, "variable.label"]
if (all(is.na(labels))) {
# Assume no labels if none except NA provided
levels <- ldf[, "value"]
labels <- "missing"
} else {
# Add "missing" to labels not found
labels <- labels[match(levels, ldf$value)]
labels[is.na(labels)] <- "missing"
}
dat[[col]] |>
haven::labelled() |>
`attr<-`("labels", setNames(levels, labels))
}
This function labels any values not found as "missing"
, except in the case where there are no labels other than missing values provided, e.g. "age"
, where it assumes that no labels except "missing"
are required.
We can then apply this to the relevant columns:
out_df <- dplyr::as_tibble(df) # just so it prints nicely
out_df[names(levels_list)] <- lapply(names(levels_list), create_haven_labelled)
out_df
# # A tibble: 10 × 5
# id sex age school parent
# <int> <dbl+lbl> <dbl+lbl> <int+lbl> <dbl+lbl>
# 1 1 1 [male] 5 1 [private] 1 [mother]
# 2 2 2 [female] 5 2 [state] 1 [mother]
# 3 3 1 [male] 5 3 [missing] 2 [father]
# 4 4 2 [female] 5 1 [private] 2 [father]
# 5 5 1 [male] 5 2 [state] 2 [father]
# 6 6 2 [female] 5 3 [missing] 2 [father]
# 7 7 1 [male] 5 1 [private] 2 [father]
# 8 8 2 [female] 5 2 [state] 2 [father]
# 9 9 1 [male] 5 3 [missing] 2 [father]
# 10 10 2 [female] -96 [missing] 1 [private] 2 [father]
identical(out_df, desired) # TRUE