I have seen similar posts like this one which say that getting the error message saying: Coefficients: (1 not defined because of singularities)
is because of nearly perfect correlation among predictors used in the lm()
call.
But in my case, there is no nearly perfect correlation among predictors, but still one coefficient (X_wthn_outcome
) returns NA
in the output of lm()
.
I wonder what is wrong with the coefficient that returns NA
?
For reproducibility purposes, the exact same data and code is provided below.
library(dplyr)
set.seed(132)
(data <- expand.grid(study = 1:1e3, outcome = rep(1:50,2)))
data$X <- rnorm(nrow(data))
e <- rnorm(nrow(data), 0, 2)
data$yi <- .8 +.6*data$X + e
dat <- data %>%
group_by(study) %>%
mutate(X_btw_study = mean(X), X_wthn_study = X-X_btw_study) %>%
group_by(outcome, .add = TRUE) %>%
mutate(X_btw_outcome = mean(X), X_wthn_outcome = X-X_btw_outcome) %>% ungroup()
round(cor(select(dat,-study,-outcome,-X,-yi)),3)
# X_btw_study X_wthn_study X_btw_outcome X_wthn_outcome
#X_btw_study 1.000 0.000 0.141 0.00
#X_wthn_study 0.000 1.000 0.698 0.71
#X_btw_outcome 0.141 0.698 1.000 0.00
#X_wthn_outcome 0.000 0.710 0.000 1.00
summary(lm(yi ~ 0 + X_btw_study + X_btw_outcome + X_wthn_study
+ X_wthn_outcome, data = dat))
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#X_btw_study 0.524093 0.069610 7.529 5.15e-14 ***
#X_btw_outcome 0.014557 0.013694 1.063 0.288
#X_wthn_study 0.589517 0.009649 61.096 < 2e-16 ***
#X_wthn_outcome NA NA NA NA ## What's wrong with this variable
You have constructed a problem where the three-way combination of X_btw_study + X_btw_outcome + X_wthn_study
perfectly predicts X_wthn_outcome
:
lm(X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study , data = dat)
#------------------
Call:
lm(formula = X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study,
data = dat)
Coefficients:
(Intercept) X_btw_study X_btw_outcome X_wthn_study
1.165e-17 1.000e+00 -1.000e+00 1.000e+00
#--------------
summary( lm(X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study , data = dat) )
#---------------
Call:
lm(formula = X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study,
data = dat)
Residuals:
Min 1Q Median 3Q Max
-3.901e-14 -6.000e-17 0.000e+00 5.000e-17 3.195e-13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.165e-17 3.242e-18 3.594e+00 0.000326 ***
X_btw_study 1.000e+00 3.312e-17 3.020e+16 < 2e-16 ***
X_btw_outcome -1.000e+00 6.515e-18 -1.535e+17 < 2e-16 ***
X_wthn_study 1.000e+00 4.590e-18 2.178e+17 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.025e-15 on 99996 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.582e+34 on 3 and 99996 DF, p-value: < 2.2e-16
You have an adjusted R^2 of 1 with three predictors. So MULTI-collinearity but not two-way collinearity. (R has caught on to your tricks and won't let you get away with this dplyr
game of "hide the dependencies".) I think the dependencies might have been more apparently if you had build the variables in a sequence of separate steps rather than in a piped chain.