r regression logistic-regression predict ordinal

probit ordinal logistic regression with `MASS::polr`: How to make prediction on new data

I want to do ordinal regression in R, so I want to use the polr function from the MASS package. First I create a model like this:

model <- polr(labels ~ var1 + var2, Hess = TRUE)

Now I want to use the model to predict new cases. I thought that would simply be:

pred <- predict(model, data = c(newVar1, newVar2))

However it seems that predict is somehow predicting on the training set, not the new data. When my training set is 2000 examples, and my new data is 700 examples. I still get 2000 predicted labels.

So my question is: how do I use polr to make predictions on new data?

Solution

Sadly there is no documentation entry for predict.polr, otherwise you can simply read that for how to use predict correctly.

In R, only for few primitive model fitting functions like smooth.spline, predict expect a vector for newdata (this is reasonable as smooth.spline handles univariate regression). Generally, predict expects a data frame or list, whose names match variables specified in model formula or as shown in model frame (the "terms" attributes). If you fit a model:

labels ~ var1 + var2

then you should construct newdata:

predict(model, newdata = data.frame(var1 = newVar1, var2 = newVar2))

predict(model, newdata = list(var1 = newVar1, var2 = newVar2))

Note, it is newdata, not data for predict.

Since there is no documentation, it may be good if we look at:

args(MASS:::predict.polr)
#function (object, newdata, type = c("class", "probs"), ...)

and you can even check source code (not long):

MASS:::predict.polr

You will see in source code:

newdata <- as.data.frame(newdata)
m <- model.frame(Terms, newdata, na.action = function(x) x, 
       xlev = object$xlevels)

This explains why newdata should be passed as a data frame, and why variable names must match what is in Terms.

Here is a reproducible example:

library(MASS)
house.plr <- polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing)

## check model terms inside model frame
attr(terms(house.plr$model), "term.labels")
# [1] "Infl" "Type" "Cont"

When making prediction, these will not work:

## `data` ignored as no such argument
predict(house.plr, data = data.frame("Low", "Tower", "Low"))
## no_match in names 
predict(house.plr, newdata = data.frame("Low", "Tower", "Low"))

This works:

predict(house.plr, newdata = data.frame(Infl = "Low", Type = "Tower", Cont = "Low"))

#[1] Low
#Levels: Low Medium High