Search code examples
rglm

GLM submodel testing in R: why all statistics are still the same after remove one continuous covariate?


I am doing a submodel testing. The smaller model is nested in the bigger model. The bigger model has one continuous variable compared to the smaller model. I use the likelihood ratio test. The result is quite strange. Both models have the same statistics such as residual deviance and df. I also find two models have the same estimated coefficients are std.errors. How is the fact possible?

summary(m2221)

Call:
glm(formula = clm ~ veh_age + veh_body + agecat + veh_value:veh_age + 
veh_value:area, family = "binomial", data = Car)

Deviance Residuals: 
Min       1Q   Median       3Q      Max  
-0.9245  -0.3939  -0.3683  -0.3437   2.7095  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -1.294118   0.382755  -3.381 0.000722 ***
veh_age2            0.051790   0.098463   0.526 0.598897    
veh_age3           -0.166801   0.094789  -1.760 0.078457 .  
veh_age4           -0.239862   0.096154  -2.495 0.012611 *  
veh_bodyCONVT      -2.184124   0.707884  -3.085 0.002033 ** 
veh_bodyCOUPE      -0.850675   0.393625  -2.161 0.030685 *  
veh_bodyHBACK      -1.105087   0.374134  -2.954 0.003140 ** 
veh_bodyHDTOP      -0.973472   0.383404  -2.539 0.011116 *  
veh_bodyMCARA      -0.649036   0.469407  -1.383 0.166765    
veh_bodyMIBUS      -1.295135   0.404691  -3.200 0.001373 ** 
veh_bodyPANVN      -0.903032   0.395295  -2.284 0.022345 *  
veh_bodyRDSTR      -1.108488   0.826541  -1.341 0.179883    
veh_bodySEDAN      -1.097931   0.373578  -2.939 0.003293 ** 
veh_bodySTNWG      -1.129122   0.373713  -3.021 0.002516 ** 
veh_bodyTRUCK      -1.156099   0.384088  -3.010 0.002613 ** 
veh_bodyUTE        -1.343958   0.377653  -3.559 0.000373 ***
agecat2            -0.198002   0.058382  -3.391 0.000695 ***
agecat3            -0.224492   0.056905  -3.945 7.98e-05 ***
agecat4            -0.253377   0.056774  -4.463 8.09e-06 ***
agecat5            -0.441906   0.063227  -6.989 2.76e-12 ***
agecat6            -0.447231   0.072292  -6.186 6.15e-10 ***
veh_age1:veh_value -0.000637   0.026387  -0.024 0.980740    
veh_age2:veh_value  0.035386   0.031465   1.125 0.260753    
veh_age3:veh_value  0.114485   0.036690   3.120 0.001806 ** 
veh_age4:veh_value  0.189866   0.057573   3.298 0.000974 ***
veh_value:areaB     0.044099   0.021550   2.046 0.040722 *  
veh_value:areaC     0.021892   0.019189   1.141 0.253931    
veh_value:areaD    -0.023616   0.024939  -0.947 0.343658    
veh_value:areaE    -0.013506   0.026886  -0.502 0.615415    
veh_value:areaF     0.057780   0.026602   2.172 0.029850 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 33767  on 67855  degrees of freedom
Residual deviance: 33592  on 67826  degrees of freedom
AIC: 33652

Number of Fisher Scoring iterations: 5

 summary(m222)

Call:
glm(formula = clm ~ veh_value + veh_age + veh_body + agecat + 
    veh_value:veh_age + veh_value:area, family = "binomial", 
    data = Car)

  Deviance Residuals: 
  Min       1Q   Median       3Q      Max  
-0.9245  -0.3939  -0.3683  -0.3437   2.7095  

 Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -1.294118   0.382755  -3.381 0.000722 ***
veh_value          -0.000637   0.026387  -0.024 0.980740    
veh_age2            0.051790   0.098463   0.526 0.598897    
veh_age3           -0.166801   0.094789  -1.760 0.078457 .  
veh_age4           -0.239862   0.096154  -2.495 0.012611 *  
veh_bodyCONVT      -2.184124   0.707884  -3.085 0.002033 ** 
veh_bodyCOUPE      -0.850675   0.393625  -2.161 0.030685 *  
veh_bodyHBACK      -1.105087   0.374134  -2.954 0.003140 ** 
veh_bodyHDTOP      -0.973472   0.383404  -2.539 0.011116 *  
veh_bodyMCARA      -0.649036   0.469407  -1.383 0.166765    
veh_bodyMIBUS      -1.295135   0.404691  -3.200 0.001373 ** 
veh_bodyPANVN      -0.903032   0.395295  -2.284 0.022345 *  
veh_bodyRDSTR      -1.108488   0.826541  -1.341 0.179883    
veh_bodySEDAN      -1.097931   0.373578  -2.939 0.003293 ** 
veh_bodySTNWG      -1.129122   0.373713  -3.021 0.002516 ** 
veh_bodyTRUCK      -1.156099   0.384088  -3.010 0.002613 ** 
veh_bodyUTE        -1.343958   0.377653  -3.559 0.000373 ***
agecat2            -0.198002   0.058382  -3.391 0.000695 ***
agecat3            -0.224492   0.056905  -3.945 7.98e-05 ***
agecat4            -0.253377   0.056774  -4.463 8.09e-06 ***
agecat5            -0.441906   0.063227  -6.989 2.76e-12 ***
agecat6            -0.447231   0.072292  -6.186 6.15e-10 ***
veh_value:veh_age2  0.036023   0.034997   1.029 0.303331    
veh_value:veh_age3  0.115122   0.039476   2.916 0.003543 ** 
veh_value:veh_age4  0.190503   0.058691   3.246 0.001171 ** 
veh_value:areaB     0.044099   0.021550   2.046 0.040722 *  
veh_value:areaC     0.021892   0.019189   1.141 0.253931    
veh_value:areaD    -0.023616   0.024939  -0.947 0.343658    
veh_value:areaE    -0.013506   0.026886  -0.502 0.615415    
veh_value:areaF     0.057780   0.026602   2.172 0.029850 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 33767  on 67855  degrees of freedom
Residual deviance: 33592  on 67826  degrees of freedom
AIC: 33652

anova(m2221,m222,  test ="LRT")###
Analysis of Deviance Table

Model 1: clm ~ veh_age + veh_body + agecat + veh_value:veh_age + 
veh_value:area
Model 2: clm ~ veh_value + veh_age + veh_body + agecat + veh_value:veh_age + 
veh_value:area
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1     67826      33592                     
2     67826      33592  0        0  

Solution

  • You have specified the same model two different ways. To show this, I'll first explain what is going under the hood a bit, then I'll walk through the coefficients of your two models to show they are the same, and I'll end with some higher level intuition and explanation.

    Different formula creating different interaction terms

    First, the only difference between your two models is the first models includes veh_value as a predictor, whereas in the second does not. However, veh_value is interacted with other predictors in both models.

    So let's consider a simple reproducible example and see what R does when we do this. I'll use the model.matrix to simply feed in two different formulas and view the continuous and dummy variables that R creates as a result.

    colnames(model.matrix(am ~ mpg + factor(cyl):mpg, mtcars))
    #[1] "(Intercept)"      "mpg"              "mpg:factor(cyl)6" "mpg:factor(cyl)8"
    
    colnames(model.matrix(am ~ factor(cyl):mpg, mtcars))
    #"(Intercept)"      "factor(cyl)4:mpg" "factor(cyl)6:mpg" "factor(cyl)8:mpg"
    

    Notice in the first call I included the continuous predictor mpg whereas I did not include it in the second call (a simpler example of what you are doing).

    Now note that the second model matrix contains an extra interaction variable (factor(cyl)4:mpg) in the model that the first does not. In other words, because we did not include mpg in the model directly, all levels of cyl get included in the interaction.

    Your models are the same

    Your models are essentially doing the same thing as the simple example above and we can show that the coefficients at the end of the day are actually the same when added together.

    In your first model all 4 levels of veh_age are included when it is included as interaction with veh_value but veh_value is not included in the model.

    veh_age1:veh_value -0.000637   0.026387  -0.024 0.980740    
    veh_age2:veh_value  0.035386   0.031465   1.125 0.260753    
    veh_age3:veh_value  0.114485   0.036690   3.120 0.001806 ** 
    veh_age4:veh_value  0.189866   0.057573   3.298 0.000974 ***
    

    In your second model only 3 levels of veh_age are included when it is interacted with veh_value because veh_value is included in the model.

    veh_value:veh_age2  0.036023   0.034997   1.029 0.303331    
    veh_value:veh_age3  0.115122   0.039476   2.916 0.003543 ** 
    veh_value:veh_age4  0.190503   0.058691   3.246 0.001171 **
    

    Now, here is the critical piece to see that the models are actually the same. It's easiest to show by just walking through all of the levels of veh_age.

    First consider veh_age = 1

    For both models, the coefficient on veh_value conditioned on veh_age when veh_age = 1 is -0.000637

    # For first model
    veh_age1:veh_value -0.000637   0.026387  -0.024 0.980740    
    
    # For second model
    veh_value          -0.000637   0.026387  -0.024 0.980740 
    

    Now consider veh_age = 2

    For both models, the coefficient on veh_value conditioned on veh_age when veh_age = 2 is 0.035386

    # For first model
    veh_age2:veh_value  0.035386   0.031465   1.125 0.260753    
    
    # For second model - note sum of the two below is 0.035386
    veh_value          -0.000637   0.026387  -0.024 0.980740
    veh_value:veh_age2  0.036023   0.034997   1.029 0.303331
    

    Intution

    When you include the interaction veh_value:veh_age you are essentially saying that you want the coefficient of veh_value a continuous variable to be conditioned on veh_age, a categorical variable. Including both the interaction veh_value:veh_age and veh_value as predictors is saying the same thing. You want to know the coefficient of veh_value but you want to condition it based on the value of veh_age.