I am trying to reproduce missuse’s working answer for pulling out predictions from caret’s train function. I am using eleastic net and just cannot get it.
Here is a reproducible example:
require(caret)
require(glmnet)
x = matrix(rnorm(100 * 20), 100, 20)
set.seed(3)
g = sample(c(0,1), 100, replace = TRUE)
df = as.data.frame(x)
g_f = as.factor(g)
df$g_f = g_f
train_control <- trainControl(
method="cv",
number = 3,
savePredictions = T)
sorozat = seq(0, 1, 0.25)
search_grid <- expand.grid(
alpha = sorozat,
lambda = sorozat )
set.seed(3)
fit2 <- train(g_f ~ .,
data = df,
trControl = train_control,
tuneGrid = search_grid,
preProc = c("BoxCox", "center", "scale"),
method = "glmnet")
And my attempt, which gives an error:
prediction2 <- predict(fit2$finalModel,
data = predict(fit2$preProcess,
df))$prediction
Error in predict.glmnet(fit2$finalModel, data = predict(fit2, df)) : You need to supply a value for 'newx'
Below is how I can get a prediction. But how can I be sure if it's the right one if its confusion matrix:
# CM ver.1
pred_f = predict(fit2, df)
cm = as.data.frame(pred_f)
cm$g = g_f
table(cm)
g
pred_f 0 1
0 29 9
1 15 47
is different from the one provided by the model?
# CM ver.2
confusionMatrix(fit2)$table
Reference
Prediction 0 1
0 23 16
1 21 40
Thanks in advance for any help!
Edit: Added the output of confusion matrices.
The linked answer does not work for glmnet since predict.glmnet
has some peculiarities:
the data argument to predict.glmnet
is called newx
and must be a matrix.
Apart from that this predict function uses all fitted lambda to create predictions, so if you want the best one you must specify so. Additionally it is advisable to set the response to your linking:
using your example the optimal fit values were alpha = 0.5 and lambda = 0.25. The alpha is set inside the model but the lambda must be specified during prediction.
But first we must preprocess the test data (same as in the linked answer):
predict(fit2$preProcess, df)
this however returns a data frame with the class column, so in order to supply it to predict.glmnet
the response column (factor) must be removed and the data frame converted to a matrix:
as.matrix(predict(fit2$preProcess, df)[,-21])
Now to call predict.glmnet
with the optimal lambda of 0.25 setting the prediction type to class:
library(glmnet)
prediction2 <- predict(fit2$finalModel,
newx = as.matrix(predict(fit2$preProcess,
df)[,-21]),
type = "class",
s = 0.25)
head(prediction2)
1
[1,] "0"
[2,] "1"
[3,] "0"
[4,] "0"
[5,] "0"
[6,] "0"
EDIT: to answer the edited question about confusion matrix differences.
When you call confusionMatrix
on the output of train
then the resulting matrix is obtained from the out of fold predictions during resampling - is less biased since these are test set predictions.
When you fit a model on all of the data (this is fit2$finalModel
) and use it to predict on the same data you are creating train set predictions - has a lot of bias since the model was fit using these observations. This is the reason the off diagonal sum is much less in this case compared to calling confusionMatrix
on fit2
. This is sometimes referred to overfitting - the model predicts much better the data it has already seen.
In short
`confusionMatrix(fit2)`
produces a confusion matrix from the out of fold predictions. This can be used as a metric for model selection.
while
confusionMatrix(as.factor(prediction2), g_f)
produces a highly biased confusion matrix based on the model prediction on the train data. This should not be used as a metric for model selection.
EDTI2: It just occurred to me that this might be a XY problem.
If you just want the cross validated prediction you can simply use:
fit2$pred
If you want to calculate the AUC for these you should specify you want class probabilities in trainControl:
train_control <- trainControl(
method="cv",
number = 3,
savePredictions = TRUE,
classProbs = TRUE)
one additional concern is that class levels need to be valid variables names, so numbers such as 0 and 1 wont work, an easy fix is:
df$g_f <- factor(df$g_f,
levels = c(0, 1),
labels = c("zero", "one"))
After the fit:
set.seed(3)
fit2 <- train(g_f ~ .,
data = df,
trControl = train_control,
tuneGrid = search_grid,
preProc = c("BoxCox", "center", "scale"),
method = "glmnet")
the predictions are in fit2$pred
:
head(fit2$pred)
#output
pred obs rowIndex zero one alpha lambda Resample
1 one one 2 0.4513397 0.5486603 0 1 Fold1
2 zero zero 4 0.5764889 0.4235111 0 1 Fold1
3 zero one 5 0.5154925 0.4845075 0 1 Fold1
4 one one 6 0.4836418 0.5163582 0 1 Fold1
5 zero zero 7 0.5199623 0.4800377 0 1 Fold1
6 one zero 8 0.4770536 0.5229464 0 1 Fold1
These predictions are for all tested hyper parameter combinations to get just for the best performing hyper pars:
library(tidyverse)
fit2$pred %>%
filter(alpha == fit2$bestTune$alpha&
lambda == fit2$bestTune$alpha) -> best_preds
There are two approaches to obtain a metric from these predictions.
Approach 1. you can do it with the combined fold predictions (less frequent but useful when you have small data sets so there is high variance in the fold performance)
pROC::roc(best_preds$obs, best_preds$one)$auc
#output
Area under the curve: 0.6631
Approach 2. you can calculate it per fold and average (much more common and used by caret internally for any metric:
library(tidyverse)
best_preds %>%
group_by(Resample) %>%
summarise(auc = as.numeric(pROC::roc(obs, one)$auc))
#output
Resample auc
<chr> <dbl>
1 Fold1 0.592
2 Fold2 0.757
3 Fold3 0.614
The above is AUC per fold
To average it:
best_preds %>%
group_by(Resample) %>%
summarise(auc = as.numeric(pROC::roc(obs, one)$auc)) %>%
ungroup() %>%
summarise(mean_auc = mean(auc))
#output
mean_auc
<dbl>
1 0.654