Search code examples
rstatar-factorr-havenforcats

Changing internal factor levels in R (important for haven - write_dta())


The haven package provides a very useful function for exporting a data frame/Tibble to Stata using the write_dta() function.

When an R factor is written into Stata (using the write_dta() function) the internal R factor levels become the numerical value saved in Stata long format with the levels being written as the variable labels. (These internal R factor levels are the same as applying as.numeric(factor) to a factor.)

I want to explicitly set R's internal factor level so as to have the desired values for numlabels in Stata.

To illustrate:

eyes <- c("blue", "brown","green", "blue", "not disclose") 
eyes_factor <- as.factor(eyes)

levels(eyes_factor)
 #[1] blue         brown        green        blue         not disclose
 #Levels: blue brown green not disclose

as.numeric(as.factor(eyes)) 
#[1] 1 2 3 1 4 # which is to be expected

However, I want to set R's internal factor levels according to a highly specific pattern. For instance I want the internal level for:

blue = 2 and brown = 1 and green = 6 while not disclose = -1

Because this matches the coding on a questionnaire.

I have tried using the lvls_recode from the forcats package. The function looks like this:

forcats::lvls_reorder
function (f, idx, ordered = NA) 
{
    f <- check_factor(f)
    if (!is.numeric(idx)) {
        stop("`idx` must be numeric", call. = FALSE)
    }
    if (!setequal(idx, lvls_seq(f)) || length(idx) != nlevels(f)) {
        stop("`idx` must contain one integer for each level of `f`", 
            call. = FALSE)
    }
    refactor(f, levels(f)[idx], ordered = ordered)
}

But as you can see here, the new idx which I would need to specific I cannot because only sequential numbers are taken.

Looking at the stats::relevel() too did not solve problem.


Solution

  • If it weren't for the -1 = disclose, you could do this simply with something like:

    eyes2 <- factor(eyes, 
               levels = c("brown", "blue", paste0("not_used_", 1:3), "green", "not disclose"))
    

    That would be exactly what you want but not disclose is 7 rather than -1. One option could be to do it this way, then recode it in Stata. A variant would be to force those not disclose values to be NA (eg just by not including "not disclose" as a valid level) - not sure how that comes into Stata.

    R factors can't have -1 as one of the underlying codes. So I don't think there's any simple way to get around this. You'll have to recode them yourself, making a look up table. For example:

    eye_codes <- data.frame(code = c(-1, 1, 2, 6),
                            level = c("not disclose", "brown", "blue", "green"),
                            stringsAsFactors = FALSE)
    
    library(dplyr)
    eyes3 <-left_join(data.frame(eyes), eye_codes, by = c("eyes" = "level"))
    
    eyes3
    

    Which gets you:

              eyes code
    1         blue    2
    2        brown    1
    3        green    6
    4         blue    2
    5 not disclose   -1
    

    The code column is what you want here. Note I used dplyr::left_join rather than merge to have easier-controlled behaviour of the ordering of the result.

    This is a bit of a pain of course. Me, I'd save the data out of R as platform-agnostic character text (not factors at all, which just seems to have too many risks), then if you need them explicitly coded in a particular way in Stata, do that recoding in Stata.