Search code examples
rdplyrregressionlinear-regressionlm

lm: 1 coefficient not defined because of singularities while no major collinearity issues


I have seen similar posts like this one which say that getting the error message saying: Coefficients: (1 not defined because of singularities) is because of nearly perfect correlation among predictors used in the lm() call.

But in my case, there is no nearly perfect correlation among predictors, but still one coefficient (X_wthn_outcome) returns NA in the output of lm().

I wonder what is wrong with the coefficient that returns NA?

For reproducibility purposes, the exact same data and code is provided below.

library(dplyr)

set.seed(132)
(data <- expand.grid(study = 1:1e3, outcome = rep(1:50,2)))
data$X <- rnorm(nrow(data))
e <- rnorm(nrow(data), 0, 2)
data$yi <- .8 +.6*data$X + e

dat <- data %>% 
  group_by(study) %>% 
  mutate(X_btw_study = mean(X), X_wthn_study = X-X_btw_study) %>%
  group_by(outcome, .add = TRUE) %>%
  mutate(X_btw_outcome = mean(X), X_wthn_outcome = X-X_btw_outcome) %>% ungroup()
  
round(cor(select(dat,-study,-outcome,-X,-yi)),3)

#               X_btw_study X_wthn_study X_btw_outcome X_wthn_outcome
#X_btw_study          1.000        0.000         0.141           0.00
#X_wthn_study         0.000        1.000         0.698           0.71
#X_btw_outcome        0.141        0.698         1.000           0.00
#X_wthn_outcome       0.000        0.710         0.000           1.00

summary(lm(yi ~ 0 + X_btw_study + X_btw_outcome + X_wthn_study
   + X_wthn_outcome, data = dat))

#Coefficients: (1 not defined because of singularities)
#               Estimate Std. Error t value Pr(>|t|)    
#X_btw_study    0.524093   0.069610   7.529 5.15e-14 ***
#X_btw_outcome  0.014557   0.013694   1.063    0.288    
#X_wthn_study   0.589517   0.009649  61.096  < 2e-16 ***
#X_wthn_outcome       NA         NA      NA       NA  ## What's wrong with this variable

Solution

  • You have constructed a problem where the three-way combination of X_btw_study + X_btw_outcome + X_wthn_study perfectly predicts X_wthn_outcome:

    lm(X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study , data = dat)
    #------------------
    Call:
    lm(formula = X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study, 
        data = dat)
    
    Coefficients:
      (Intercept)    X_btw_study  X_btw_outcome   X_wthn_study  
        1.165e-17      1.000e+00     -1.000e+00      1.000e+00  
    #--------------
    summary( lm(X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study , data = dat) )
    
    #---------------
    Call:
    lm(formula = X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study, 
        data = dat)
    
    Residuals:
           Min         1Q     Median         3Q        Max 
    -3.901e-14 -6.000e-17  0.000e+00  5.000e-17  3.195e-13 
    
    Coefficients:
                    Estimate Std. Error    t value Pr(>|t|)    
    (Intercept)    1.165e-17  3.242e-18  3.594e+00 0.000326 ***
    X_btw_study    1.000e+00  3.312e-17  3.020e+16  < 2e-16 ***
    X_btw_outcome -1.000e+00  6.515e-18 -1.535e+17  < 2e-16 ***
    X_wthn_study   1.000e+00  4.590e-18  2.178e+17  < 2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 1.025e-15 on 99996 degrees of freedom
    Multiple R-squared:      1, Adjusted R-squared:      1 
    F-statistic: 1.582e+34 on 3 and 99996 DF,  p-value: < 2.2e-16
    

    You have an adjusted R^2 of 1 with three predictors. So MULTI-collinearity but not two-way collinearity. (R has caught on to your tricks and won't let you get away with this dplyr game of "hide the dependencies".) I think the dependencies might have been more apparently if you had build the variables in a sequence of separate steps rather than in a piped chain.