I'm using regularized logistic regression for a classification problem using the glmnet
package. In the development process, everything is working fine, but I have a problem when it comes to making predictions on blind test data.
Because I don't know the class label, my data frame for testing has a column less than the one I used for training. This seems to be a problem for predict.glm()
, because it expects matching dimensions - I can "fix" it by adding a column with some arbitrary labels in the test data, but this seems like a bad idea. I hope this example will illustrate the problem:
library(glmnet)
example = data.frame(rnorm(20))
colnames(example) = "A"
example$B = rnorm(20)
example$class = ((example$A + example$B) > 0)*1
testframe = data.frame(rnorm(20))
colnames(testframe) = "A"
testframe$B = rnorm(20)
x = model.matrix(class ~ ., data = example)
y = data.matrix(example$class)
# this is similar to the situation I have with my data
# the class labels are ommited on the blind test set
So if I just proceed like this, I get an error:
x.test = as.matrix(testframe)
ridge = glmnet(x,y, alpha = 0, family = "binomial", lambda = 0.01789997)
ridge.pred = predict(ridge, newx = x.test, s = 0.01789997, type = "class")
Error in cbind2(1, newx) %*% nbeta: Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90
I can "fix" the problem by adding a class column to my test data:
testframe$class = 0
x.test = model.matrix(class ~ ., data = testframe)
ridge.pred2 = predict(ridge, newx = x.test, s = 0.01789997, type = "class")
So I have a couple of questions about this:
a) Is this workaround with adding a column safe to do? It feels very wrong/dangerous to do this, because I don't know if the predict method will use it (why would it require this column to be there otherwise?
b) What's a better / "the correct" way to do this?
Thanks in advance!
Answer
When you create the matrix x
, drop the (Intercept)
column (which is always the first column). Then your predict
function will work without the workaround. Specifically, use this line to create x.
x = model.matrix(class ~ ., data = example)[,-1]
Explanation
You are getting an error because model.matrix
is creating a column for an intercept in the model, which is not on your x.test
matrix.
colnames(x)
# [1] "(Intercept)" "A" "B"
colnames(x.test)
# [1] "A" "B"
Unless you set intercept=FALSE
, glmnet
will add an intercept to the model for you. Thus, the simplest thing to do is exclude the intercept column from both the x
and x.test
matrices.