Search code examples
rcategorical-dataanova

Why Do User Defined Contrasts in R need to be provided as an INVERSE matrix of weights?


I would like to use some user defined contrasts in R in addition to the default contrast codes (contr.treatment / contr.sum / contr.helmert). However, the guidance I have read indicates that these need to be provided in an inverse matrix. Could anybody explain why?

Ie, the guidance here: https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/ states:

The final contrast matrix (or coding scheme) turns out to be the inverse of mat transposed.

Similarly, this site: https://rstudio-pubs-static.s3.amazonaws.com/65059_586f394d8eb84f84b1baaf56ffb6b47f.html writes:

While there are a handful of automatic contrast functions in R (what I’ve been using so far), you will sometimes find yourself wanting to run comparisons that are not included there. When that happens, you can specify them yourself. You need to be careful, though, because the contrasts() function is a sneaky little bastard, as noted above. To apply contrast weights, you’ll need to give it the inverse of your matrix of weights.

Neither explains why. In addition, computing an inverse matrix messes up the contrast coefficients so that they become difficult to interpret as they are no longer a standard unit apart, so I'd like to know why it's necessary.


Solution

  • Let me try and see what happens on a simple example.

    We start with the data:

    x = sample(letters[1:3], size = 300, replace = T)
    y = rnorm(300)
    head(x, 10)
     [1] "c" "c" "c" "c" "a" "a" "a" "c" "b" "a"
    

    Let's say your custom contrasts are:

    contr_mat = matrix(c(1, -0.5, -0.5,
                         1, 0, -1),
                       nrow = 2,  byrow = TRUE)
    contr_mat
         [,1] [,2] [,3]
    [1,]    1 -0.5 -0.5
    [2,]    1  0.0 -1.0
    

    So you are comparing for example the mean of y in group a to the mean of y in group c (second contrast).

    Let's build a contrast coding (weight) matrix using the pseudoinverse of the contrast matrix using MASS::ginv(). We also add an intercept term:

    C1 <- cbind(1, ginv(contr_mat))
    round(C1, 3)
         [,1]   [,2] [,3]
    [1,]    1  0.667    0
    [2,]    1 -1.333    1
    [3,]    1  0.667   -1
    

    This matrix is square and full rank, and by inverting it we get the contrasts back, with an intercept term:

    C <- solve(C1)
    round(C, 2)
         [,1]  [,2]  [,3]
    [1,] 0.33  0.33  0.33
    [2,] 1.00 -0.50 -0.50
    [3,] 1.00  0.00 -1.00
    

    Because C and C1 are inverses, you can include them in the model supporting your comparisons:

    y = Xu + e = XIu + e= X(C1C)u + e = (XC1)(Cu) + e

    where X the design matrix (for this cell means model), u is the vector of means, I the identity matrix, and e the vector of random component with normal distribution. (Cu) represents the contrasts you want.

    Now we can use the least square equations to evaluate the contrasts Cu, using the modified design matrix (XC1). So the inverse of the contrast matrix is used to evaluate the actual contrasts Cu:

    X <- model.matrix(~x + 0)    # model matrix for the cell means model
    X1 <- X %*% C1               # this is the modified model matrix with the inverse
    

    Now we solve by the method of least squares or by lm(). lm() takes the inverse of your contrast matrix as argument and adds the intercept term, then evaluates the contrasts by the least squares method:

    # least-squares equation:
    
    solve ( t(X1)  %*% X1 ) %*% t(X1) %*% y
                [,1]
    [1,] -0.06524626
    [2,]  0.03627632
    [3,]  0.04820965
    
    # lm() with modified design matrix
    summary(lm(y ~ X1 +0 ) )
    
    Call:
    lm(formula = y ~ X1 + 0)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -3.1953 -0.7073  0.0250  0.7382  2.7251 
    
    Coefficients:
        Estimate Std. Error t value Pr(>|t|)
    X11 -0.06525    0.06049  -1.079    0.282
    X12  0.03628    0.12596   0.288    0.774
    X13  0.04821    0.14336   0.336    0.737
    
    
    # lm() with your contrasts
    summary( lm(y ~ x, contrasts = list(x = ginv(contr_mat))) )
    
    Call:
    lm(formula = y ~ x, contrasts = list(x = ginv(contr_mat)))
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -3.1953 -0.7073  0.0250  0.7382  2.7251 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)
    (Intercept) -0.06525    0.06049  -1.079    0.282
    x1           0.03628    0.12596   0.288    0.774
    x2           0.04821    0.14336   0.336    0.737
    

    The last two rows contain the comparisons you want:

    means <- tapply(X = y, INDEX = factor(x), FUN = mean)
    C %*% means
                [,1]
    [1,] -0.06524626
    [2,]  0.03627632
    [3,]  0.04820965
    
    (means[1] + means[2] + means[3])/3
    -0.06524626 
    means[1] - (means[2] + means[3])/2
    0.03627632 
    means[1] - means[3]
    0.04820965
    

    I made a similar response with better formatting for the matrix terms here:

    https://stats.stackexchange.com/questions/535123/custom-contrasts-in-r-why-should-i-take-the-generalized-inverse-of-transpose-of/612043#612043

    A very good reference is:

    https://cran.r-project.org/web/packages/codingMatrices/vignettes/codingMatrices.pdf