Search code examples
rlogistic-regressionglmnet

predict.glm() on blind test data


I'm using regularized logistic regression for a classification problem using the glmnet package. In the development process, everything is working fine, but I have a problem when it comes to making predictions on blind test data.

Because I don't know the class label, my data frame for testing has a column less than the one I used for training. This seems to be a problem for predict.glm(), because it expects matching dimensions - I can "fix" it by adding a column with some arbitrary labels in the test data, but this seems like a bad idea. I hope this example will illustrate the problem:

library(glmnet)
example = data.frame(rnorm(20))
colnames(example) = "A"
example$B = rnorm(20)
example$class = ((example$A + example$B) > 0)*1

testframe = data.frame(rnorm(20))
colnames(testframe) = "A"
testframe$B = rnorm(20)

x = model.matrix(class ~ ., data = example)
y = data.matrix(example$class)

# this is similar to the situation I have with my data
# the class labels are ommited on the blind test set

So if I just proceed like this, I get an error:

x.test = as.matrix(testframe)
ridge = glmnet(x,y, alpha = 0, family = "binomial", lambda = 0.01789997)
ridge.pred = predict(ridge, newx = x.test, s = 0.01789997, type = "class")

Error in cbind2(1, newx) %*% nbeta: Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90

I can "fix" the problem by adding a class column to my test data:

testframe$class = 0
x.test = model.matrix(class ~ ., data = testframe)
ridge.pred2 = predict(ridge, newx = x.test, s = 0.01789997, type = "class")

So I have a couple of questions about this:
a) Is this workaround with adding a column safe to do? It feels very wrong/dangerous to do this, because I don't know if the predict method will use it (why would it require this column to be there otherwise?
b) What's a better / "the correct" way to do this?

Thanks in advance!


Solution

  • Answer

    When you create the matrix x, drop the (Intercept) column (which is always the first column). Then your predict function will work without the workaround. Specifically, use this line to create x.

    x = model.matrix(class ~ ., data = example)[,-1]
    

    Explanation

    You are getting an error because model.matrix is creating a column for an intercept in the model, which is not on your x.test matrix.

    colnames(x)
    # [1] "(Intercept)" "A"           "B"          
    colnames(x.test)
    # [1] "A" "B"
    

    Unless you set intercept=FALSE, glmnet will add an intercept to the model for you. Thus, the simplest thing to do is exclude the intercept column from both the x and x.test matrices.