Search code examples
rlabelstata

Labelling with "factor" change variable values


I am currently moving from Stata to R, trying to do on R what I did on Stata, starting from scratch. I imported raw data from Stata and had to dump my labels to avoid them overwriting the variable values, and I'm now trying to generate them back in R, as well as generating my dummy variables again from multilevel variables.

SO I did that:

newvar<-basevar
newvar<-mapvalues(newvar, c(1, 2, 3, 4, 5), c(1, 0, 0, 0, 0 ))

newvar <- factor(newvar,
                    levels = c(0,1),
                    labels = c("Bad", "Good"))

describe(newvar)

This worked perfectly, and I got what I expected, a normal describe result with frequency and proportions, correctly labelled.

Then I realized my 0/1 values had been overwritten with 1 instead of 0 and 2 instead of 1.

Is that a normal part of how labelling works in R? Is there a way to add labels while conserving the initial values of the variable?

I'm used to working with 0 and 1, for coding efficiency (and since Stata tends to interpret 1/2 as numerical which added extra steps to go back to dummy variables, but since I set the variable as factor in R, I should not have this kind of problem), and labels to get perfectly understandable results (tables and graphs).

Should I learn to work differently with R?


Solution

  • As far as I know, the first level of an factor is always represented with 1. It is how R works.

    In other functions such as lm() R treats the first level (1) as the reference and will make dummies in the background.

    Small example:

    set.seed(314)
    newvar <- c(1, 0, 0, 0, 0 )
    outcome <- newvar + rnorm(5)/5 
    
    newvar <- factor(newvar,
                     levels = c(0,1),
                     labels = c("Bad", "Good"))
    
    
    summary(lm(outcome ~ newvar))
    

    result:

        Call:
      lm(formula = outcome ~ newvar)
    
    Residuals:
      1        2        3        4        5 
    0.00000  0.17959 -0.13249 -0.10664  0.05954 
    
    Coefficients:
      Estimate Std. Error t value Pr(>|t|)  
    (Intercept) -0.03409    0.07344  -0.464   0.6741  
    newvarGood   0.77645    0.16422   4.728   0.0179 *
      ---
      Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.1469 on 3 degrees of freedom
    Multiple R-squared:  0.8817,    Adjusted R-squared:  0.8422 
    F-statistic: 22.36 on 1 and 3 DF,  p-value: 0.01793