Search code examples
rr-haven

match labels with na_tags


The haven package preserves both value labels and tagged NAs when reading Stata/SPSS files. For example, in the GSS's variable for self-employment, the labels suggest there are three different kinds of NA values:

library(tidyverse)
library(haven)

download.file(url="http://gss.norc.org/Documents/stata/2016_stata.zip",
              destfile = "2016_stata.zip")
unzip("2016_stata.zip")

gss <- read_dta("GSS2016.dta")

attr(gss$wrkslf, "labels")
#> self-employed  someone else            DK           IAP            NA 
#>             1             2            NA            NA            NA

Looking at the na_tag() for that variable, we can confirm that there are three types of NA tags:

table(na_tag(gss$wrkslf))
#> 
#>  d  i  n 
#>  4 90  5

My question is, how do we find out which strings in the labels correspond to which of the NA tags? In this example, we can infer that the d,i, and n tags probably correspond to the DK, IAP, and NA labels respectively just based on their letters (and we could always check the documentation), but I'd like a way to do this programmatically, if possible.

This would be useful if, for example, you wanted to produce a tabulation of a particular variable which displays the values of a variable alongside their associated labels, including for tagged NAs.


Solution

  • Looking at the definition of print_labels I see that na tags and labels are associated like this:

    format_tagged_na(attr(gss$wrkslf, "labels"))
    self-employed  someone else            DK           IAP            NA 
          "    1"       "    2"       "NA(d)"       "NA(i)"       "NA(n)"