The haven
package preserves both value labels and tagged NAs when reading Stata/SPSS files. For example, in the GSS's variable for self-employment, the labels suggest there are three different kinds of NA values:
library(tidyverse)
library(haven)
download.file(url="http://gss.norc.org/Documents/stata/2016_stata.zip",
destfile = "2016_stata.zip")
unzip("2016_stata.zip")
gss <- read_dta("GSS2016.dta")
attr(gss$wrkslf, "labels")
#> self-employed someone else DK IAP NA
#> 1 2 NA NA NA
Looking at the na_tag()
for that variable, we can confirm that there are three types of NA tags:
table(na_tag(gss$wrkslf))
#>
#> d i n
#> 4 90 5
My question is, how do we find out which strings in the labels
correspond to which of the NA tags? In this example, we can infer that the d
,i
, and n
tags probably correspond to the DK
, IAP
, and NA
labels respectively just based on their letters (and we could always check the documentation), but I'd like a way to do this programmatically, if possible.
This would be useful if, for example, you wanted to produce a tabulation of a particular variable which displays the values of a variable alongside their associated labels, including for tagged NAs.
Looking at the definition of print_labels
I see that na tags and labels are associated like this:
format_tagged_na(attr(gss$wrkslf, "labels"))
self-employed someone else DK IAP NA
" 1" " 2" "NA(d)" "NA(i)" "NA(n)"