Is there a way to force model.matrix.lm
to drop multicollinear variables, as is done in the estimation stage by lm()
?
Here is an example:
library(fixest)
N <- 10
x1 <- rnorm(N)
x2 <- x1
y <- 1 + x1 + x2 + rnorm(N)
df <- data.frame(y = y, x1 = x1, x2 = x2)
fit_lm <- lm(y ~ x1 + x2, data = df)
summary(fit_lm)
# Call:
# lm(formula = y ~ x1 + x2, data = df)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.82680 -0.41503 0.05499 0.67185 0.97830
#
# Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.7494 0.2885 2.598 0.0317 *
# x1 2.3905 0.3157 7.571 6.48e-05 ***
# x2 NA NA NA NA
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.8924 on 8 degrees of freedom
# Multiple R-squared: 0.8775, Adjusted R-squared: 0.8622
# F-statistic: 57.33 on 1 and 8 DF, p-value: 6.476e-05
Note that lm()
drops the collinear variable x2
from the model. But model.matrix()
keeps it:
model.matrix(fit_lm)
# (Intercept) x1 x2
#1 1 1.41175158 1.41175158
#2 1 0.06164133 0.06164133
#3 1 0.09285047 0.09285047
#4 1 -0.63202909 -0.63202909
#5 1 0.25189850 0.25189850
#6 1 -0.18553830 -0.18553830
#7 1 0.65630180 0.65630180
#8 1 -1.77536852 -1.77536852
#9 1 -0.30571009 -0.30571009
#10 1 -1.47296229 -1.47296229
#attr(,"assign")
#[1] 0 1 2
The model.matrix
method from fixest instead allows to drop x2
:
fit_feols <- feols(y ~ x1 + x2, data = df)
model.matrix(fit_feols, type = "rhs", collin.rm = TRUE)
# (Intercept) x1
# [1,] 1 1.41175158
# [2,] 1 0.06164133
# [3,] 1 0.09285047
# [4,] 1 -0.63202909
# [5,] 1 0.25189850
# [6,] 1 -0.18553830
# [7,] 1 0.65630180
# [8,] 1 -1.77536852
# [9,] 1 -0.30571009
# [10,] 1 -1.47296229
Is there a way to drop x2
when calling model.matrix.lm()
?
So long as the overhead from running the linear model is not too high, you could write a little function like the one here to do it:
N <- 10
x1 <- rnorm(N)
x2 <- x1
y <- 1 + x1 + x2 + rnorm(N)
df <- data.frame(y = y, x1 = x1, x2 = x2)
fit_lm <- lm(y ~ x1 + x2, data = df)
model.matrix2 <- function(model){
bn <- names(na.omit(coef(model)))
X <- model.matrix(model)
X[,colnames(X) %in% bn]
}
model.matrix2(fit_lm)
#> (Intercept) x1
#> 1 1 -0.04654473
#> 2 1 2.14473751
#> 3 1 0.02688125
#> 4 1 0.95071038
#> 5 1 -1.41621259
#> 6 1 1.47840480
#> 7 1 0.56580182
#> 8 1 0.14480401
#> 9 1 -0.02404072
#> 10 1 -0.14393258
Created on 2022-05-02 by the reprex package (v2.0.1)
In the code above, model.matrix2()
is the function that post-processes the model matrix to contain only the variables that have non-missing coefficients in the linear model.