Search code examples
rmodellinear-regressionlmdummy-variable

How to estimate a lm dummy regression avoiding multicollinearity?


I have a problem regression with lm on dummy variables. I want to figure out of the seasonal influence (seasonalities) change as time passes by. I established the following regression to do so:

AT.trendinseason.lm <- lm(DTR.detrended~0+dum.jan+dum.feb+dum.mar+dum.apr+dum.may+dum.jun+dum.jul+dum.aug+dum.sep+dum.oct+dum.nov+dum.dec+dum.jan*t+dum.feb*t+dum.mar*t+dum.apr*t+dum.may*t+dum.jun*t+dum.jul*t+dum.aug*t+dum.sep*t+dum.oct*t+dum.nov*t+dum.dec*t)

The output I get is the following:

summary(AT.trendinseason.lm)

Call:
lm(formula = DTR.detrended ~ 0 + dum.jan + dum.feb + dum.mar + 
    dum.apr + dum.may + dum.jun + dum.jul + dum.aug + dum.sep + 
    dum.oct + dum.nov + dum.dec + dum.jan * t + dum.feb * t + 
    dum.mar * t + dum.apr * t + dum.may * t + dum.jun * t + dum.jul * 
    t + dum.aug * t + dum.sep * t + dum.oct * t + dum.nov * t + 
    dum.dec * t)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.4047 -2.2737 -0.3229  2.0987 18.9906 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
dum.jan   -2.495e+00  1.121e-01 -22.262  < 2e-16 ***
dum.feb   -1.527e+00  1.176e-01 -12.983  < 2e-16 ***
dum.mar    2.493e-01  1.124e-01   2.218 0.026552 *  
dum.apr    1.266e+00  1.144e-01  11.073  < 2e-16 ***
dum.may    1.785e+00  1.127e-01  15.844  < 2e-16 ***
dum.jun    1.597e+00  1.147e-01  13.926  < 2e-16 ***
dum.jul    1.882e+00  1.131e-01  16.640  < 2e-16 ***
dum.aug    1.544e+00  1.126e-01  13.721  < 2e-16 ***
dum.sep    1.335e+00  1.134e-01  11.780  < 2e-16 ***
dum.oct    8.306e-02  1.117e-01   0.744 0.456961    
dum.nov   -2.545e+00  1.137e-01 -22.390  < 2e-16 ***
dum.dec   -3.101e+00  1.119e-01 -27.703  < 2e-16 ***
t         -1.343e-05  5.431e-06  -2.473 0.013389 *  
dum.jan:t -8.571e-06  7.681e-06  -1.116 0.264444    
dum.feb:t -3.094e-06  7.866e-06  -0.393 0.694090    
dum.mar:t  5.346e-06  7.681e-06   0.696 0.486406    
dum.apr:t  3.850e-05  7.744e-06   4.971 6.69e-07 ***
dum.may:t  2.748e-05  7.681e-06   3.578 0.000346 ***
dum.jun:t  2.959e-05  7.744e-06   3.821 0.000133 ***
dum.jul:t  3.384e-05  7.698e-06   4.396 1.10e-05 ***
dum.aug:t  4.494e-05  7.711e-06   5.828 5.67e-09 ***
dum.sep:t -1.921e-06  7.744e-06  -0.248 0.804105    
dum.oct:t -1.526e-05  7.681e-06  -1.987 0.046943 *  
dum.nov:t  8.864e-07  7.744e-06   0.114 0.908876    
dum.dec:t         NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.093 on 35745 degrees of freedom
Multiple R-squared:  0.3145,    Adjusted R-squared:  0.314 
F-statistic: 683.2 on 24 and 35745 DF,  p-value: < 2.2e-16

But the case is that I know that there shouldn't be a problem with multicollinearity. Still R does omit my variable. Is there a way I can prevent R from doing so?

The model I want to follow is from a paper I read and there it seemed to work out:

enter image description here

This is the approach I want to take but it doesn't seem to work.

Please help.


Solution

  • I solved the problem and it simply depended on how I wrote the interaction terms. It seems as if R hast some trouble with the * sign. I substituted the * with a : and it worked out. I don't know why but thank god I found the solution. The new code is:

    AT.trendinseason.lm <- lm(DTR.detrended~0+dum.jan+dum.feb+dum.mar+dum.apr+dum.may+dum.jun+dum.jul+dum.aug+dum.sep+dum.oct+dum.nov+dum.dec+dum.jan:t+dum.feb:t+dum.mar:t+dum.apr:t+dum.may:t+dum.jun:t+dum.jul:t+dum.aug:t+dum.sep:t+dum.oct:t+dum.nov:t+dum.dec:t)
    
    

    Giving me the desired results:

    Call:
    lm(formula = DTR.detrended ~ 0 + dum.jan + dum.feb + dum.mar + 
        dum.apr + dum.may + dum.jun + dum.jul + dum.aug + dum.sep + 
        dum.oct + dum.nov + dum.dec + dum.jan:t + dum.feb:t + dum.mar:t + 
        dum.apr:t + dum.may:t + dum.jun:t + dum.jul:t + dum.aug:t + 
        dum.sep:t + dum.oct:t + dum.nov:t + dum.dec:t)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -9.4047 -2.2737 -0.3229  2.0987 18.9906 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    dum.jan   -2.495e+00  1.121e-01 -22.262  < 2e-16 ***
    dum.feb   -1.527e+00  1.176e-01 -12.983  < 2e-16 ***
    dum.mar    2.493e-01  1.124e-01   2.218 0.026552 *  
    dum.apr    1.266e+00  1.144e-01  11.073  < 2e-16 ***
    dum.may    1.785e+00  1.127e-01  15.844  < 2e-16 ***
    dum.jun    1.597e+00  1.147e-01  13.926  < 2e-16 ***
    dum.jul    1.882e+00  1.131e-01  16.640  < 2e-16 ***
    dum.aug    1.544e+00  1.126e-01  13.721  < 2e-16 ***
    dum.sep    1.335e+00  1.134e-01  11.780  < 2e-16 ***
    dum.oct    8.306e-02  1.117e-01   0.744 0.456961    
    dum.nov   -2.545e+00  1.137e-01 -22.390  < 2e-16 ***
    dum.dec   -3.101e+00  1.119e-01 -27.703  < 2e-16 ***
    dum.jan:t -2.200e-05  5.431e-06  -4.052 5.10e-05 ***
    dum.feb:t -1.653e-05  5.691e-06  -2.904 0.003685 ** 
    dum.mar:t -8.087e-06  5.431e-06  -1.489 0.136489    
    dum.apr:t  2.507e-05  5.521e-06   4.540 5.64e-06 ***
    dum.may:t  1.405e-05  5.431e-06   2.587 0.009688 ** 
    dum.jun:t  1.616e-05  5.521e-06   2.927 0.003422 ** 
    dum.jul:t  2.041e-05  5.455e-06   3.741 0.000184 ***
    dum.aug:t  3.150e-05  5.474e-06   5.755 8.73e-09 ***
    dum.sep:t -1.535e-05  5.521e-06  -2.781 0.005420 ** 
    dum.oct:t -2.869e-05  5.431e-06  -5.283 1.28e-07 ***
    dum.nov:t -1.255e-05  5.521e-06  -2.273 0.023056 *  
    dum.dec:t -1.343e-05  5.431e-06  -2.473 0.013389 *  
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 3.093 on 35745 degrees of freedom
    Multiple R-squared:  0.3145,    Adjusted R-squared:  0.314 
    F-statistic: 683.2 on 24 and 35745 DF,  p-value: < 2.2e-16
    
    

    In any case you know one way to solve this problem now. I hope it helps someone.