Search code examples
rregressiondummy-variable

Factor levels default to 1 and 2 in R | Dummy variable


I am transitioning from Stata to R. In Stata, if I label a factor levels (say--0 and 1) to (M and F), 0 and 1 would remain as they are. Moreover, this is required for dummy-variable linear regression in most software including Excel and SPSS.

However, I've noticed that R defaults factor levels to 1,2 instead of 0,1. I don't know why R does this although regression internally (and correctly) assumes 0 and 1 as the factor variable. I would appreciate any help.

Here's what I did:

Try #1:

sex<-c(0,1,0,1,1)
sex<-factor(sex,levels = c(1,0),labels = c("F","M"))
str(sex)
Factor w/ 2 levels "F","M": 2 1 2 1 1

It seems that factor levels are now reset to 1 and 2. I believe 1 and 2s are references to the factor level here. However, I have lost the original values i.e. 0s and 1s.

Try2:

sex<-c(0,1,0,1,1)
sex<-factor(sex,levels = c(0,1),labels = c("F","M"))
str(sex)
Factor w/ 2 levels "F","M": 1 2 1 2 2

Ditto. My 0's and 1's are now 1's and 2's. Quite Surprising. Why is this happening.

Try3 Now, I wanted to see whether 1s and 2s have any bad effect regression. So, here's what I did:

Here's what my data looks like:

> head(data.frame(sassign$total_,sassign$gender))
  sassign.total_ sassign.gender
1            357              M
2            138              M
3            172              F
4            272              F
5            149              F
6            113              F

myfit<-lm(sassign$total_ ~ sassign$gender)

myfit$coefficients
    (Intercept) sassign$genderM 
      200.63522        23.00606  

So, it turns out that the means are correct. While running the regression, R did use 0 and 1 value as dummies.

I did check other threads on SO, but they mostly talk about how R codes factor variables without telling me why. Stata and SPSS generally require the base variable to be "0." So, I thought of asking about this.

I'd appreciate any thoughts.


Solution

  • In short, you are just mixing up two different concepts. I will clarify them one by one in the following.


    The meaning of integers you see in str()

    What you see from str() is the internal representation of a factor variable. A factor is internally an integer, where the number gives the position of levels inside the vector. For example:

    x <- gl(3, 2, labels = letters[1:3])
    #[1] a a b b c c
    #Levels: a b c
    
    storage.mode(x)  ## or `typeof(x)`
    #[1] "integer"
    
    str(x)
    # Factor w/ 3 levels "a","b","c": 1 1 2 2 3 3
    
    as.integer(x)
    #[1] 1 1 2 2 3 3
    
    levels(x)
    #[1] "a" "b" "c"
    

    A common use of such positions, is to perform as.character(x) in the most efficient way:

    levels(x)[x]
    #[1] "a" "a" "b" "b" "c" "c"
    

    Your misunderstanding of what a model matrix looks like

    It seems to me that you thought a model matrix is obtained by

    cbind(1L, as.integer(x))
    #     [,1] [,2]
    #[1,]    1    1
    #[2,]    1    1
    #[3,]    1    2
    #[4,]    1    2
    #[5,]    1    3
    #[6,]    1    3
    

    which is not true. In this fashion, you are just treating a factor variable as a numerical variable.

    The model matrix is constructed this way:

    xlevels <- levels(x)
    cbind(1L, match(x, xlevels[2], nomatch=0), match(x, xlevels[3], nomatch=0))
    #     [,1] [,2] [,3]
    #[1,]    1    0    0
    #[2,]    1    0    0
    #[3,]    1    1    0
    #[4,]    1    1    0
    #[5,]    1    0    1
    #[6,]    1    0    1
    

    The 1 and 0 implies "match" / "occurrence" and "no-match" / "no-occurrence", respectively.

    The R routine model.matrix will do this for you efficiently, with easy-to-read column names and row names:

    model.matrix(~x)
    #  (Intercept) xb xc
    #1           1  0  0
    #2           1  0  0
    #3           1  1  0
    #4           1  1  0
    #5           1  0  1
    #6           1  0  1
    

    Write an R function to produce a model matrix ourselves

    We could write a nominal routine mm to generate a model matrix. Though it is much less efficient than model.matrix, it may help one digest this concept better.

    mm <- function (x, contrast = TRUE) {
      xlevels <- levels(x)
      lst <- lapply(xlevels, function (z) match(x, z, nomatch = 0L))
      if (contrast) do.call("cbind", c(list(1L), lst[-1]))
      else do.call("cbind", lst)
      }
    

    For example, if we have a factor y with 5 levels:

    set.seed(1); y <- factor(sample(1:5, 10, replace=TRUE), labels = letters[1:5])
    y
    # [1] b b c e b e e d d a
    #Levels: a b c d e
    str(y)
    #Factor w/ 5 levels "a","b","c","d",..: 2 2 3 5 2 5 5 4 4 1
    

    Its model matrix with / without contrast treatment is respectively:

    mm(y, TRUE)
    #      [,1] [,2] [,3] [,4] [,5]
    # [1,]    1    1    0    0    0
    # [2,]    1    1    0    0    0
    # [3,]    1    0    1    0    0
    # [4,]    1    0    0    0    1
    # [5,]    1    1    0    0    0
    # [6,]    1    0    0    0    1
    # [7,]    1    0    0    0    1
    # [8,]    1    0    0    1    0
    # [9,]    1    0    0    1    0
    #[10,]    1    0    0    0    0
    
    mm(y, FALSE)
    #      [,1] [,2] [,3] [,4] [,5]
    # [1,]    0    1    0    0    0
    # [2,]    0    1    0    0    0
    # [3,]    0    0    1    0    0
    # [4,]    0    0    0    0    1
    # [5,]    0    1    0    0    0
    # [6,]    0    0    0    0    1
    # [7,]    0    0    0    0    1
    # [8,]    0    0    0    1    0
    # [9,]    0    0    0    1    0
    #[10,]    1    0    0    0    0
    

    The corresponding model.matrix call will be respectively:

    model.matrix(~ y)
    model.matrix(~ y - 1)