How to label values from a data dictionary in R

I have two files, one of which being a .csv file and the other an Excel document with a data dictionary in it. I want to label the values of the .csv file with the values from the data dictionary, so that the data frame looks like one loaded from a .dta/.sav file (in that the cells don't contain the values as a character/factor, but a labelled value). Is there a way of assigning value labels to cells from a data dictionary?"

Dummy data are below. I did work out how to do it using match_df from {matchmaker}, but this put all the values in as character rather than labelling them.

# where dd is the data dictionary and df the data frame
dd <- data.frame(variable.name = c(rep("sex", 2), 
                                   rep("age", 1), 
                                   rep("school", 3), 
                                   rep("parent", 2)), 
                 value = c(1, 2, 
                           -96, 
                           1, 2, -96, 
                           1, 2), 
                 variable.label = c("male", "female", 
                                    NA, 
                                    "private", "state", "missing", 
                                    "mother", "father"))


df <- data.frame(id = 1:10, 
                 sex = rep(c(1, 2), 5), 
                 age = c(rep(5, 9), -96), 
                 school = as.integer(c(rep(c(1, 2, 3), 3), 1)),
                 parent = c(rep(1, 2), rep(2, 8)))

# the following works to an extent, but it's not the desired outcome:
match_df(df, 
         dictionary = dd, 
         from = "value", 
         to = "variable.label",
         by = "variable.name")

# desired outcome would look something like this:
 dput()
structure(list(id = 1:10, sex = structure(c(1, 2, 1, 2, 1, 2, 
1, 2, 1, 2), labels = c(male = 1, female = 2), class = c("haven_labelled", 
"vctrs_vctr", "double")), age = structure(c(5, 5, 5, 5, 5, 5, 
5, 5, 5, -96), labels = c(missing = -96), class = c("haven_labelled", 
"vctrs_vctr", "double")), school = structure(c(1L, 2L, 3L, 1L, 
2L, 3L, 1L, 2L, 3L, 1L), labels = structure(1:3, names = c("private", 
"state", "missing")), class = c("haven_labelled", "vctrs_vctr", 
"integer")), parent = structure(c(1, 1, 2, 2, 2, 2, 2, 2, 2, 
2), labels = c(mother = 1, father = 2), class = c("haven_labelled", 
"vctrs_vctr", "double"))), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

Solution

I think the first thing to do is split(dd) into a list of data frames, each of which contains the appropriate factor levels for one column.

levels_list <- split(dd, ~variable.name)

# This looks like:
# $age
#   variable.name value variable.label
# 3           age   -96           <NA>

# $parent
#   variable.name value variable.label
# 7        parent     1         mother
# 8        parent     2         father
# <etc>

You can then write a function to construct the haven_labelled class which takes one of these data frames and df and returns a labelled vector.

create_haven_labelled <- function(col, dat = df, ldf = levels_list[[col]]) {
    levels <- unique(dat[[col]])
    labels <- ldf[, "variable.label"]

    if (all(is.na(labels))) {
        # Assume no labels if none except NA provided
        levels <- ldf[, "value"]
        labels <- "missing"
    } else { 
        # Add "missing" to labels not found
        labels <- labels[match(levels, ldf$value)]
        labels[is.na(labels)] <- "missing"
    }

    dat[[col]] |>
        haven::labelled() |>
        `attr<-`("labels", setNames(levels, labels))
}

This function labels any values not found as "missing", except in the case where there are no labels other than missing values provided, e.g. "age", where it assumes that no labels except "missing" are required.

We can then apply this to the relevant columns:

out_df <- dplyr::as_tibble(df) # just so it prints nicely
out_df[names(levels_list)] <- lapply(names(levels_list), create_haven_labelled)

out_df
# # A tibble: 10 × 5
#       id sex        age           school      parent    
#    <int> <dbl+lbl>  <dbl+lbl>     <int+lbl>   <dbl+lbl> 
#  1     1 1 [male]     5           1 [private] 1 [mother]
#  2     2 2 [female]   5           2 [state]   1 [mother]
#  3     3 1 [male]     5           3 [missing] 2 [father]
#  4     4 2 [female]   5           1 [private] 2 [father]
#  5     5 1 [male]     5           2 [state]   2 [father]
#  6     6 2 [female]   5           3 [missing] 2 [father]
#  7     7 1 [male]     5           1 [private] 2 [father]
#  8     8 2 [female]   5           2 [state]   2 [father]
#  9     9 1 [male]     5           3 [missing] 2 [father]
# 10    10 2 [female] -96 [missing] 1 [private] 2 [father]

identical(out_df, desired) # TRUE