Search code examples
rglmnet

Why is predict.glmnet not predicting probabilities?


I'm working on a model to predict the probability that college baseball players will make the major leagues. My dataset has 633 observations and 13 predictors with a binary response. The code below generates smaller reproducible examples of training and testing datasets:

set.seed(1)
OBP <- rnorm(50, mean=1, sd=.2)
HR.PCT <- rnorm(50, mean=1, sd=.2)
AGE <- rnorm(50, mean=21, sd=1)
CONF <- sample(c("A","B","C","D","E"), size=50, replace=TRUE)
CONF <- factor(CONF, levels=c("A","B","C","D","E"))
df.train <- data.frame(OBP, HR.PCT, AGE, CONF)
df.train <- df.train[order(-OBP),]
df.train$MADE.MAJORS <- 0
df.train$MADE.MAJORS[1:10] <- 1

OBP <- rnorm(10, mean=1, sd=.2)
HR.PCT <- rnorm(10, mean=1, sd=.2)
AGE <- rnorm(10, mean=21, sd=1)
CONF <- sample(c("A","B","C","D","E"), size=10, replace=TRUE)
CONF <- factor(CONF, levels=c("A","B","C","D","E"))
MADE.MAJORS <- sample(0:1, size=10, replace=TRUE, prob=c(0.8,0.2))
df.test <- data.frame(OBP, HR.PCT, AGE, CONF, MADE.MAJORS)

I then used glmnet to perform the lasso with logistic regression and generate predictions. I want the predictions to be in the form of probabilities (that is, between 0 and 1).

library(glmnet)
train.mtx <- with(df.train, model.matrix(MADE.MAJORS ~ OBP + HR.PCT + AGE + CONF)[,-1])
glmmod <- glmnet(x=train.mtx, y=as.factor(df.train$MADE.MAJORS), alpha=1, family="binomial")
cv.glmmod <- cv.glmnet(x=train.mtx, y=df.train$MADE.MAJORS, alpha=1)

test.mtx <- with(df.test, model.matrix(MADE.MAJORS ~ OBP + HR.PCT + AGE + CONF)[,-1])
preds <- predict.glmnet(object=glmmod, newx=test.mtx, s=cv.glmmod$lambda.min, type="response")
cv.preds <- predict.cv.glmnet(object=cv.glmmod, newx=test.mtx, s="lambda.min")

Here are the predictions:

> preds
            1
1  -3.2589440
2  -0.4435265
3   3.9646670
4   0.3772816
5   0.9952887
6  -7.3555661
7   0.2283675
8  -2.3871317
9  -8.1632749
10 -1.3563051

> cv.preds
            1
1   0.1568839
2   0.3630938
3   0.7435941
4   0.4808428
5   0.5261076
6  -0.1431655
7   0.4123054
8   0.2207381
9  -0.1446941
10  0.2962391

I have a few questions about these results. Feel free to answer any or all (or none) of them. I'm most interested in an answer for the first question.

  1. Why are the predictions from predict.glmnet (the preds vector) not in the form of probabilities? I put the preds values through the inverse logit function and got reasonable probabilities. Was that correct?

  2. The predictions from predict.cv.glmnet (the cv.preds vector) mostly look like probabilities, but some of them are negative. Why is this?

  3. When I use the glmnet function to create the glmmod object, I include the family="binomial" argument to indicate that I'm using logistic regression. However, when I use the cv.glmnet function to find the best value for lambda, I'm not able to specify logistic regression. Am I actually getting the best value for lambda if the cross-validation doesn't use logistic regression?

  4. Similarly, when I use the predict.cv.glmnet function, I'm not able to specify logistic regression. Does this function produce the predictions that I want?


Solution

  • I am not 100% sure on the following because the package does seem to operate counter to its documentation, as you've noticed, but it may produce some indication whether your thinking is along the right path.

    Question 1

    Yes, you're right. Note that,

    > predict.glmnet(object=glmmod, newx=test.mtx, s=cv.glmmod$lambda.min, type="link")
                1
    1  -3.2589440
    2  -0.4435265
    3   3.9646670
    4   0.3772816
    5   0.9952887
    6  -7.3555661
    7   0.2283675
    8  -2.3871317
    9  -8.1632749
    10 -1.3563051
    

    which is the same output as type="response". Thus, putting it through the inverse logit function would be the right way to get the probabilities. As to why is this happening, I have not clue -perhaps a bug.

    Question 2...4

    For the cv.preds, you're getting something along the lines of probabilities because you're fitting a Gaussian link. In order to fit a logit link, you should specify the family parameter. Namely:

    cv.glmmod <- cv.glmnet(x=train.mtx, y=df.train$MADE.MAJORS, alpha=1, family="binomial")
    
    > cv.preds
                1
    1  -10.873290
    2    1.299113
    3   15.812671
    4    3.622259
    5    5.621857
    6  -24.826551
    7    1.734000
    8   -5.420878
    9  -26.160403
    10  -4.496020
    

    In this case, cv.preds will output along the real line and you can put those values through the inverse logit to get the probabilities.