I'm performing 5-fold cross-validation using glm to perform logistic regression. Here is a reproducible example using the built-in cars dataset
library(caret)
data("mtcars")
str(mtcars)
mtcars$vs<-as.factor(mtcars$vs)
df0<-na.omit(mtcars)
set.seed(123)
train.control <- trainControl(method = "cv", number = 5)
# Train the model
model <- train(vs ~., data = mtcars, method = "glm",
trControl = train.control)
print(model)
summary(model)
model$resample
confusionMatrix(model)
pred.mod <- predict(model)
confusionMatrix(data=pred.mod, reference=mtcars$vs)
Output
> print(model)
Generalized Linear Model
32 samples
10 predictors
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 25, 26, 25, 27, 25
Resampling results:
Accuracy Kappa
0.9095238 0.8164638
> summary(model)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-1.181e-05 -2.110e-08 -2.110e-08 2.110e-08 1.181e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.117e+01 1.589e+07 0 1
mpg 2.451e+00 5.979e+04 0 1
cyl -3.908e+01 2.947e+05 0 1
disp -1.927e-02 8.518e+03 0 1
hp 3.129e-01 2.283e+04 0 1
drat -2.735e+01 9.696e+05 0 1
wt -1.248e+01 6.437e+05 0 1
qsec 1.565e+01 3.845e+05 0 1
am -4.562e+01 3.632e+05 0 1
gear -2.835e+01 5.448e+05 0 1
carb 1.788e+01 2.971e+05 0 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4.3860e+01 on 31 degrees of freedom
Residual deviance: 7.2154e-10 on 21 degrees of freedom
AIC: 22
Number of Fisher Scoring iterations: 25
> model$resample
Accuracy Kappa Resample
1 0.8571429 0.6956522 Fold1
2 0.8333333 0.6666667 Fold2
3 0.8571429 0.7200000 Fold3
4 1.0000000 1.0000000 Fold4
5 1.0000000 1.0000000 Fold5
> confusionMatrix(model)
Cross-Validated (5 fold) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction 0 1
0 50.0 3.1
1 6.2 40.6
Accuracy (average) : 0.9062
> pred.mod <- predict(model)
> confusionMatrix(data=pred.mod, reference=mtcars$vs)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 18 0
1 0 14
Accuracy : 1
95% CI : (0.8911, 1)
No Information Rate : 0.5625
P-Value [Acc > NIR] : 1.009e-08
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.5625
Detection Rate : 0.5625
Detection Prevalence : 0.5625
Balanced Accuracy : 1.0000
'Positive' Class : 0
This all works fine, but I'd like to obtain the summary(model) information for each fold(meaning the coefficients, p valuess, z scores, etc that you get when performing summary() ), as well as the sensitivity and specificity for each fold if possible. Can someone help?
Is an interesting question. The values you are looking for cannot be obtained directly from the model
object, but can be recalculated by knowing which observations of training data are part of which fold. This info can be extracted from the model
if you specify savePredictions = "all"
in the trainControl
function. With the prediction for each k fold you can do something like this:
#first of all, save all predictions from all folds
set.seed(123)
train.control <- trainControl(method = "cv", number = 5,savePredictions =
"all")
# Train the model
model <- train(vs ~., data = mtcars, method = "glm",
trControl = train.control)
#now we can extract the statistics you are looking for
fold <- unique(pred$Resample)
mystat <- function(model,x){
pred <- model$pred
df <- pred[pred$Resample==x,]
cm <- confusionMatrix(df$pred,df$obs)
control <- trainControl(method = "none")
newdat <- mtcars[pred$rowIndex,]
fit <- train(vs~.,data=newdat,trControl=control)
summ <- summary(model)
z_p <- summ$coefficients[,3:4]
return(list(cm,z_p))
}
stat <- lapply(fold, mystat,model=model)
names(stat) <- fold
Note that by specifying method="none"
in trainControl
force train
fit the model to the entire training set without any resampling or parameter tuning.
in this form, it is not a beautiful function, but it does what you want and you can always adapt it to make it more general.