Search code examples
rformular-factor

R and factor coding in formula


How do I use the formula interface if I want custom valued dummies, e.g. if I want values 1 and two, rather than 0 and 1. The estimation might look like the following where supp is a factor variable.

fit <- lm(len ~ dose + supp, data = ToothGrowth)

In this example, there is not much use of the different values, but in many cases of a "re-written" model it can be useful.

EDIT: Actually, I have e.g. 3 levels, and want the two columns to be coded differently, so one is a 1/0 variable, and the other is a 1/2 variable. The above example only has two levels.


Solution

  • You can set the contrasts to be whatever you want by creating the matrix you want to use and setting it either to the contrasts argument of lm or setting the default contrast of the factor itself.

    Some sample data:

    set.seed(6)
    d <- data.frame(g=gl(3,5,labels=letters[1:3]), x=round(rnorm(15,50,20)))
    

    The contrasts you have in mind:

    mycontrasts <- matrix(c(0,0,1,0,1,1), byrow=TRUE, nrow=3)
    colnames(mycontrasts) <- c("12","23")
    mycontrasts
    #     12 23
    #[1,]  0  0
    #[2,]  1  0
    #[3,]  1  1
    

    Then you use this in the lm call:

    > lm(x ~ g, data=d, contrasts=list(g=mycontrasts))
    
    Call:
    lm(formula = x ~ g, data = d, contrasts = list(g = mycontrasts))
    
    Coefficients:
    (Intercept)          g12          g23  
           58.8        -13.6          5.8  
    

    We can check that it does the right thing by comparing the means:

    > diff(tapply(d$x, d$g, mean))
        b     c 
    -13.6   5.8 
    

    The default contrast is to use the first level as baseline:

    > lm(x ~ g, data=d)
    
    Call:
    lm(formula = x ~ g, data = d)
    
    Coefficients:
    (Intercept)           gb           gc  
           58.8        -13.6         -7.8  
    

    But that can be changed with the contrasts command:

    > contrasts(d$g) <- mycontrasts
    > lm(x ~ g, data=d)
    
    Call:
    lm(formula = x ~ g, data = d)
    
    Coefficients:
    (Intercept)          g12          g23  
           58.8        -13.6          5.8