Search code examples
rcross-validationglmnet

How to see the out of sample GLMNET LogReg results to build a confusion matrix?


I would like to get the confusion matrix from all out of sample predictions when using glmnet logistic regression model with LOOCV. The simple example below uses data frame with 10 observations. Is there a way to see what the result is for the individual 10 out of sample predictions?

Based on what I have red from the glmnet manual I think I have to use the option ‘keep = TRUE’ in the cv.glmnet call. But after that I am not sure what to do. For example, I am not sure if the last 4 lines in the code give me what I want as it prints too many confusion matrixes.

library(glmnet)
library(dplyr)


set.seed(123)
x1 <- c(1, 2, 3, 4, 4, 5, 5, 6, 7, 8)
x2 <- rnorm(10)
y <- c(0, 0, 0, 0, 1, 0, 1, 1, 1, 1)
df <- data.frame(x1, x2, y)
plot(df)

# Find the best lambda for RIDGE with LOOCV
cv.ridge <- cv.glmnet(x=as.matrix(df[,-3]), y=y, alpha=0, family="binomial", 
                      nfolds=NROW(df), standardize = FALSE, keep = TRUE)

plot(cv.ridge)
cv.ridge$lambda.min
coef(cv.ridge, cv.ridge$lambda.min)


# Fit model with lambda.min
fit.ridge <- glmnet(x=as.matrix(df[,-3]), y=y, alpha=0,
                    family="binomial", lambda = cv.ridge$lambda.min, standardize = FALSE)

print(fit.ridge)

# Make prediction on the training dataset
probabilities <- fit.ridge %>% predict(newx = as.matrix(df[,-3]), standardize = FALSE)
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)

# Model classification accuracy on the training dataset
observed.classes <- y
CA_min <<- mean(predicted.classes == observed.classes)

# Confusion matrix on the training dataset
table(y, predicted.classes)

# Not sure what are the confusion martixes printed by the last two lines ???
cnf <- confusion.glmnet(cv.ridge$fit.preval, newy = y, family = "binomial")
best <- cv.ridge$index["min",]
print(cnf[[best]])
print(cnf)


Solution

  • If you look at cv.ridge$fit.preval :

    cv.ridge$fit.preval[,1:3]
                  s0         s1         s2
     [1,]  0.2231436  0.2231436  0.2231436
     [2,]  0.2231436  0.2231436  0.2204341
     [3,]  0.2231436  0.2212310  0.2207636
     [4,]  0.2226969  0.2224932  0.2224300
     [5,] -0.2238043 -0.2238687 -0.2239393
     [6,]  0.2234414  0.2234704  0.2235023
     [7,] -0.2228107 -0.2226589 -0.2226118
     [8,] -0.2231436 -0.2212451 -0.2207812
     [9,] -0.2231436 -0.2231436 -0.2200512
    [10,] -0.2231436 -0.2231436 -0.2231436
    

    This returns you a matrix of the predictions, in log odds ratio, every column represents a lambda value. To convert from log odds ratio to probability (check something like this notes):

    logodds_to_probability = function(x){
    return(exp(x)/(1+exp(x)))
    }
    

    Since you are interested in lambda min, then you can see we get back the same confusion matrix:

    ix = cv.ridge$index["min",]
    prob = logodds_to_probability(cv.ridge$fit.preval[,ix])
    pred = as.numeric(prob>0.5)
    
        y
    pred 0 1
       0 3 2
       1 2 3
    
    cnf[[best]]
             True
    Predicted 0 1 Total
        0     3 2     5
        1     2 3     5
        Total 5 5    10
    
     Percent Correct:  0.6