Search code examples
rglm

What does a glm model do with unknown factor levels added after training?


I am fitting a glm model with a particular factor variable with data not containing all the levels I have for that variable in the data I want to apply the model to. The unknown levels can just be ignored, so whatever the model predicts for the them, I don't care as long as it treats rest as if levels are the same in training and application data.

Since having unknown factor levels in application data gives an error, I searched for a workaround and found a great one provided by @matt_k here: "Factor has new levels" error for variable I'm not using

Now appending a new level still gives a warning message:

In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
prediction from a rank-deficient fit may be misleading

So I wanted to find out what exactly happens. I've tried it on a very simple example, leaving out cylinder 6 level in a mpg-model with mtcars:

mtcars$cyl <- as.factor(mtcars$cyl)
model <- glm(formula = mpg ~ cyl, data = mtcars[mtcars$cyl !=6,])
model$xlevels[["cyl"]] <- union(model$xlevels[["cyl"]], levels(mtcars$cyl))
mtcars$preds <- predict(model, newdata = mtcars)
head(mtcars,15)

giving me:

                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb    preds
Mazda RX4          21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 26.66364
Mazda RX4 Wag      21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 26.66364
Datsun 710         22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1 26.66364
Hornet 4 Drive     21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1 26.66364
Hornet Sportabout  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2 15.10000
Valiant            18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1 26.66364
Duster 360         14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4 15.10000
Merc 240D          24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2 26.66364
Merc 230           22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2 26.66364
Merc 280           19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4 26.66364
Merc 280C          17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4 26.66364
Merc 450SE         16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3 15.10000
Merc 450SL         17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3 15.10000
Merc 450SLC        15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3 15.10000
Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4 15.10000

For me it looks like the model just picks coefficients from another factor level (in this case the ones for cyl = 4 to make a prediction for cyl = 6). Since this would be absolutely fine for me, I would appreciate if someone could confirm that this is in fact what happens.


Solution

  • Lets start by looking at the model coefficients for the partial dataset summary(model)

    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)   26.664      1.068  24.966  < 2e-16 ***
    cyl8         -11.564      1.427  -8.102 3.45e-08 ***
    

    Predictions for cyl8 are equal to the intercept + the effect of cyl8, so 26.664 + -11.564 = 15.10. For the other factor levels (cyl4), the predictions are equal to the intercept (26.664). Adding a factor level that is unknown will yield the same prediction, as R has no basis for additional factors effect (these were excluded in the original model). We can see that the estimates of the known factors are unaffected by estimating the model on the full data.

    model2<- glm(formula = mpg ~ cyl, data = mtcars)
    summary(model2)
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  26.6636     0.9718  27.437  < 2e-16 ***
    cyl6         -6.9208     1.5583  -4.441 0.000119 ***
    cyl8        -11.5636     1.2986  -8.905 8.57e-10 ***
    

    You see that the estimated effects for cyl8 and the reference category cyl4 are unchanged (still 15.10 and 26.66 resp.). As such, the model will yield the same predictions for these factor levels. However, predictions for cyl6 are overestimated by 6.92 as you can see from the newly estimated coefficient.