Search code examples
rpoissonglmnetlasso-regression

Why are the predictions from poisson lasso regression model in glmnet not integers?


I am conducting a lasso regression modeling predictors of a count outcome in glmnet.

I am wondering what to make of the predictions from this model.

Here is some toy data. It's not very good because I don't know how to simulate multivariate data but I'm mainly interested in whether I'm getting the syntax right.

set.seed(123)
df <- data.frame(count = rpois(500, lambda = 3),
                 pred1 = rnorm(500),
                 pred2 = rnorm(500),
                 pred3 = rnorm(500),
                 pred4 = rnorm(500),
                 pred5 = rnorm(500),
                 pred6 = rnorm(500),
                 pred7 = rnorm(500),
                 pred8 = rnorm(500),
                 pred9 = rnorm(500),
                 pred10 = rnorm(500))

Now run the model

x <- model.matrix(count ~ ., df)[,-1]
y <- df$count
cvg <- cv.glmnet(x,y,family = "poisson")

now when I generate predicted outcomes

yTest <- predict(cvg, newx = x, family = "poisson", type = "link")

This is the output

# 1   1.094604
# 2   1.094604
# 3   1.094604
# 4   1.094604
# 5   1.094604
# 6   1.094604
# ... ........

Now obviously the model predictions are all the same and all terrible (unsurprising given the absence of any association between the predictors and the outcome), but the thing I am wondering is why they are not integers (with my real data I have the same problem).

I have several questions.

So my questions are:

Am I specifying the correct arguments in the glmnet.predict() function? In the help for the predict function it states that specifying type = "link" gives "the linear predictors" for poisson models, whereas specifying type = "response" gives the "fitted mean" for poisson models (in the case of my dumb example it generates 500 values of 2.988).

Shouldn't the predicted outcomes match the form of the data itself, i.e. be integers?

If I am specifying the correct arguments in the predict() function, how do I use the non-integer predictions Do I round them to the nearest integer, or just leave them alone?


Solution

  • Shouldn't the predicted outcomes match the form of the data itself, i.e. be integers?

    When you use a regression model you are associating a (conditional) probability distribution, indexed by parameters (in the Poisson case, the lambda parameter, which represents the mean) to each predictor configuration. A prediction of the response minimizes some expected loss function conditional to the predictor values so it depends on what loss function you are using.

    If you consider a 0-1 loss, then yes, the predicted values should be an integer: the mode of the distribution, its most probable value, which in the case of a Possion distribution is the floor of lambda if it is not an integer (https://en.wikipedia.org/wiki/Poisson_distribution).

    If you consider a squared loss (y - y_prediction)^2 then your prediction is the conditional expectation (see https://en.wikipedia.org/wiki/Minimum_mean_square_error#Properties), which is not necessarily an integer, just like the result you are getting.

    glmnet uses squared loss, but you can easily predict an integer value (the one that minimizes the 0-1 loss) by applying the floor() function to the predicted values output by glmnet.