Search code examples
rregressionglmcross-validationmse

MSE and cross validation score drastically differ from residuals in scale for a GLM


I have encountered something which I believe is critical and will be of use for people in the future when fitting and analyzing GLM in R. The response in my dataset is a variable of frequency data and the set contains 1762 observations. I have fitted a Negative Binomial-model (named nb1) with the command glm.nb and I wish to estimate how well the model predicts the data.

For starters - when applying the command residuals.glm (same result if I apply command residuals) I get

head(residuals.glm(nb1))
     1          2          3          4          5          6 
-1.1630170  2.9662854  2.0234981  0.1104864 -0.6636815  0.5303713 

which is reasonable and is in line with the diagnostic graphs.

This is where it becomes confusing. When calculating residuals manually I get

head(y - fitted(nb1))
      1           2           3           4           5           6 
-35.4970139  28.2611731  10.0475912   0.2914508 -10.0584696   2.4523959  

Calculating the MSE with the command residuals I get

mean(residuals(nb1)^2)
[1] 1.061085

while calculating the MSE manually I get

mean((y - fitted(nb1))^2)
[1] 4138.733

which is basically the same value as when I apply LOOCV (leave-one-out cross validation)

loocvnb <- cv.glm(dfg, nb1, data=dfg), K=1764)
$delta
[1] 4352.700 4352.614

The default function for the vector delta in LOOCV is MSE.

Why is it that the MSE for the manually omitted case and for LOOCV is so drastically different from when applying the function residuals?


Solution

  • Residuals returned by residuals.glm are by default deviance residuals. When you do y - fitted(nb1) you refer to raw residuals. Use

    residuals.glm(nb1, type = "response")