Search code examples
rr-caretglmnet

Extracting predictions from caret's train function


I am trying to reproduce missuse’s working answer for pulling out predictions from caret’s train function. I am using eleastic net and just cannot get it.

Here is a reproducible example:

require(caret)   
require(glmnet)

x = matrix(rnorm(100 * 20), 100, 20)   
set.seed(3) 
g = sample(c(0,1), 100, replace = TRUE)

df = as.data.frame(x) 
g_f = as.factor(g) 
df$g_f = g_f

train_control <- trainControl(   
method="cv",    
number = 3,    
savePredictions = T)

sorozat = seq(0, 1, 0.25)

search_grid <- expand.grid(   
alpha = sorozat,    
lambda = sorozat )

set.seed(3) 
fit2 <- train(g_f ~ .,    
data = df,    
trControl = train_control,    
tuneGrid = search_grid,   
preProc = c("BoxCox", "center", "scale"),   
method = "glmnet")

And my attempt, which gives an error:

prediction2 <- predict(fit2$finalModel,
                       data = predict(fit2$preProcess,
                                      df))$prediction

Error in predict.glmnet(fit2$finalModel, data = predict(fit2, df)) : You need to supply a value for 'newx'

Below is how I can get a prediction. But how can I be sure if it's the right one if its confusion matrix:

# CM ver.1
pred_f = predict(fit2, df)
cm = as.data.frame(pred_f)
cm$g = g_f
table(cm)
      g
pred_f  0  1
     0 29  9
     1 15 47

is different from the one provided by the model?

# CM ver.2
confusionMatrix(fit2)$table
          Reference
Prediction  0  1
         0 23 16
         1 21 40

Thanks in advance for any help!

Edit: Added the output of confusion matrices.


Solution

  • The linked answer does not work for glmnet since predict.glmnet has some peculiarities:

    the data argument to predict.glmnet is called newx and must be a matrix.

    Apart from that this predict function uses all fitted lambda to create predictions, so if you want the best one you must specify so. Additionally it is advisable to set the response to your linking:

    using your example the optimal fit values were alpha = 0.5 and lambda = 0.25. The alpha is set inside the model but the lambda must be specified during prediction.

    But first we must preprocess the test data (same as in the linked answer):

    predict(fit2$preProcess, df)
    

    this however returns a data frame with the class column, so in order to supply it to predict.glmnet the response column (factor) must be removed and the data frame converted to a matrix:

    as.matrix(predict(fit2$preProcess, df)[,-21])
    

    Now to call predict.glmnet with the optimal lambda of 0.25 setting the prediction type to class:

    library(glmnet)
    prediction2 <- predict(fit2$finalModel,
                           newx = as.matrix(predict(fit2$preProcess,
                                          df)[,-21]),
                           type = "class",
                           s = 0.25)
    
    head(prediction2)
         1  
    [1,] "0"
    [2,] "1"
    [3,] "0"
    [4,] "0"
    [5,] "0"
    [6,] "0"
    

    EDIT: to answer the edited question about confusion matrix differences.

    When you call confusionMatrix on the output of train then the resulting matrix is obtained from the out of fold predictions during resampling - is less biased since these are test set predictions.

    When you fit a model on all of the data (this is fit2$finalModel) and use it to predict on the same data you are creating train set predictions - has a lot of bias since the model was fit using these observations. This is the reason the off diagonal sum is much less in this case compared to calling confusionMatrix on fit2. This is sometimes referred to overfitting - the model predicts much better the data it has already seen.

    In short

    `confusionMatrix(fit2)`
    

    produces a confusion matrix from the out of fold predictions. This can be used as a metric for model selection.

    while

    confusionMatrix(as.factor(prediction2), g_f)
    

    produces a highly biased confusion matrix based on the model prediction on the train data. This should not be used as a metric for model selection.

    EDTI2: It just occurred to me that this might be a XY problem.

    If you just want the cross validated prediction you can simply use:

    fit2$pred
    

    If you want to calculate the AUC for these you should specify you want class probabilities in trainControl:

    train_control <- trainControl(   
      method="cv",    
      number = 3,    
      savePredictions = TRUE,
      classProbs = TRUE)
    

    one additional concern is that class levels need to be valid variables names, so numbers such as 0 and 1 wont work, an easy fix is:

    df$g_f <- factor(df$g_f,
                     levels = c(0, 1),
                     labels = c("zero", "one"))
    

    After the fit:

    set.seed(3) 
    fit2 <- train(g_f ~ .,    
                  data = df,    
                  trControl = train_control,    
                  tuneGrid = search_grid,   
                  preProc = c("BoxCox", "center", "scale"),   
                  method = "glmnet")
    

    the predictions are in fit2$pred:

    head(fit2$pred)
    #output
      pred  obs rowIndex      zero       one alpha lambda Resample
    1  one  one        2 0.4513397 0.5486603     0      1    Fold1
    2 zero zero        4 0.5764889 0.4235111     0      1    Fold1
    3 zero  one        5 0.5154925 0.4845075     0      1    Fold1
    4  one  one        6 0.4836418 0.5163582     0      1    Fold1
    5 zero zero        7 0.5199623 0.4800377     0      1    Fold1
    6  one zero        8 0.4770536 0.5229464     0      1    Fold1
    

    These predictions are for all tested hyper parameter combinations to get just for the best performing hyper pars:

    library(tidyverse)
    
    fit2$pred %>%
      filter(alpha == fit2$bestTune$alpha&
             lambda == fit2$bestTune$alpha) -> best_preds
    

    There are two approaches to obtain a metric from these predictions.

    Approach 1. you can do it with the combined fold predictions (less frequent but useful when you have small data sets so there is high variance in the fold performance)

    pROC::roc(best_preds$obs, best_preds$one)$auc
    #output
    Area under the curve: 0.6631
    

    Approach 2. you can calculate it per fold and average (much more common and used by caret internally for any metric:

    library(tidyverse)
    
    best_preds %>%
      group_by(Resample) %>%
      summarise(auc = as.numeric(pROC::roc(obs, one)$auc))
    #output
      Resample   auc
      <chr>    <dbl>
    1 Fold1    0.592
    2 Fold2    0.757
    3 Fold3    0.614
    

    The above is AUC per fold

    To average it:

    best_preds %>%
      group_by(Resample) %>%
      summarise(auc = as.numeric(pROC::roc(obs, one)$auc)) %>%
      ungroup() %>%
      summarise(mean_auc = mean(auc))
    #output
      mean_auc
         <dbl>
    1    0.654