Search code examples
rlmmodel.matrixfixest

R model.matrix drop multicollinear variables


Is there a way to force model.matrix.lm to drop multicollinear variables, as is done in the estimation stage by lm()?

Here is an example:

library(fixest)

N <- 10
x1 <- rnorm(N)
x2 <- x1
y <- 1 + x1 + x2 + rnorm(N)

df <- data.frame(y = y, x1 = x1, x2 = x2)

fit_lm <- lm(y ~ x1 + x2, data = df)
summary(fit_lm)
# Call:
#   lm(formula = y ~ x1 + x2, data = df)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -1.82680 -0.41503  0.05499  0.67185  0.97830 
# 
# Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   0.7494     0.2885   2.598   0.0317 *  
#   x1            2.3905     0.3157   7.571 6.48e-05 ***
#   x2                NA         NA      NA       NA    
# ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.8924 on 8 degrees of freedom
# Multiple R-squared:  0.8775,  Adjusted R-squared:  0.8622 
# F-statistic: 57.33 on 1 and 8 DF,  p-value: 6.476e-05

Note that lm() drops the collinear variable x2 from the model. But model.matrix() keeps it:

 model.matrix(fit_lm)
#   (Intercept)          x1          x2
#1            1  1.41175158  1.41175158
#2            1  0.06164133  0.06164133
#3            1  0.09285047  0.09285047
#4            1 -0.63202909 -0.63202909
#5            1  0.25189850  0.25189850
#6            1 -0.18553830 -0.18553830
#7            1  0.65630180  0.65630180
#8            1 -1.77536852 -1.77536852
#9            1 -0.30571009 -0.30571009
#10           1 -1.47296229 -1.47296229
#attr(,"assign")
#[1] 0 1 2

The model.matrix method from fixest instead allows to drop x2:

fit_feols <- feols(y ~ x1 + x2, data = df)
model.matrix(fit_feols, type = "rhs", collin.rm = TRUE)
# (Intercept)          x1
# [1,]           1  1.41175158
# [2,]           1  0.06164133
# [3,]           1  0.09285047
# [4,]           1 -0.63202909
# [5,]           1  0.25189850
# [6,]           1 -0.18553830
# [7,]           1  0.65630180
# [8,]           1 -1.77536852
# [9,]           1 -0.30571009
# [10,]           1 -1.47296229

Is there a way to drop x2 when calling model.matrix.lm()?


Solution

  • So long as the overhead from running the linear model is not too high, you could write a little function like the one here to do it:

    N <- 10
    x1 <- rnorm(N)
    x2 <- x1
    y <- 1 + x1 + x2 + rnorm(N)
    
    df <- data.frame(y = y, x1 = x1, x2 = x2)
    
    fit_lm <- lm(y ~ x1 + x2, data = df)
    
    model.matrix2 <- function(model){
      bn <- names(na.omit(coef(model)))
      X <- model.matrix(model)
      X[,colnames(X) %in% bn]
    }
    
    model.matrix2(fit_lm)
    #>    (Intercept)          x1
    #> 1            1 -0.04654473
    #> 2            1  2.14473751
    #> 3            1  0.02688125
    #> 4            1  0.95071038
    #> 5            1 -1.41621259
    #> 6            1  1.47840480
    #> 7            1  0.56580182
    #> 8            1  0.14480401
    #> 9            1 -0.02404072
    #> 10           1 -0.14393258
    

    Created on 2022-05-02 by the reprex package (v2.0.1)

    In the code above, model.matrix2() is the function that post-processes the model matrix to contain only the variables that have non-missing coefficients in the linear model.