I am fitting a glm model with a particular factor variable with data not containing all the levels I have for that variable in the data I want to apply the model to. The unknown levels can just be ignored, so whatever the model predicts for the them, I don't care as long as it treats rest as if levels are the same in training and application data.
Since having unknown factor levels in application data gives an error, I searched for a workaround and found a great one provided by @matt_k here: "Factor has new levels" error for variable I'm not using
Now appending a new level still gives a warning message:
In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
prediction from a rank-deficient fit may be misleading
So I wanted to find out what exactly happens. I've tried it on a very simple example, leaving out cylinder 6
level in a mpg
-model with mtcars
:
mtcars$cyl <- as.factor(mtcars$cyl)
model <- glm(formula = mpg ~ cyl, data = mtcars[mtcars$cyl !=6,])
model$xlevels[["cyl"]] <- union(model$xlevels[["cyl"]], levels(mtcars$cyl))
mtcars$preds <- predict(model, newdata = mtcars)
head(mtcars,15)
giving me:
mpg cyl disp hp drat wt qsec vs am gear carb preds
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 26.66364
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 26.66364
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 26.66364
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 26.66364
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 15.10000
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 26.66364
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 15.10000
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 26.66364
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 26.66364
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 26.66364
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 26.66364
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 15.10000
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 15.10000
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 15.10000
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 15.10000
For me it looks like the model just picks coefficients from another factor level (in this case the ones for cyl = 4
to make a prediction for cyl = 6
).
Since this would be absolutely fine for me, I would appreciate if someone could confirm that this is in fact what happens.
Lets start by looking at the model coefficients for the partial dataset summary(model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.664 1.068 24.966 < 2e-16 ***
cyl8 -11.564 1.427 -8.102 3.45e-08 ***
Predictions for cyl8
are equal to the intercept + the effect of cyl8
, so 26.664 + -11.564 = 15.10. For the other factor levels (cyl4
), the predictions are equal to the intercept (26.664). Adding a factor level that is unknown will yield the same prediction, as R has no basis for additional factors effect (these were excluded in the original model).
We can see that the estimates of the known factors are unaffected by estimating the model on the full data.
model2<- glm(formula = mpg ~ cyl, data = mtcars)
summary(model2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.6636 0.9718 27.437 < 2e-16 ***
cyl6 -6.9208 1.5583 -4.441 0.000119 ***
cyl8 -11.5636 1.2986 -8.905 8.57e-10 ***
You see that the estimated effects for cyl8
and the reference category cyl4
are unchanged (still 15.10 and 26.66 resp.). As such, the model will yield the same predictions for these factor levels. However, predictions for cyl6
are overestimated by 6.92 as you can see from the newly estimated coefficient.