Search code examples
rmachine-learningcross-validationr-caretglm

How do I obtain the coefficients, z scores, and p-values for each fold of a k-fold cross validation in R?


I'm performing 5-fold cross-validation using glm to perform logistic regression. Here is a reproducible example using the built-in cars dataset

library(caret)

data("mtcars")
str(mtcars)


mtcars$vs<-as.factor(mtcars$vs)
df0<-na.omit(mtcars)

set.seed(123) 
train.control <- trainControl(method = "cv", number = 5)
# Train the model
model <- train(vs ~., data = mtcars, method = "glm",
               trControl = train.control)


print(model)

summary(model)

model$resample

confusionMatrix(model)

pred.mod  <- predict(model)
confusionMatrix(data=pred.mod, reference=mtcars$vs)

Output


> print(model)

Generalized Linear Model 

32 samples
10 predictors
 2 classes: '0', '1' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 25, 26, 25, 27, 25 
Resampling results:

  Accuracy   Kappa    
  0.9095238  0.8164638

> summary(model)

Call:
NULL

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-1.181e-05  -2.110e-08  -2.110e-08   2.110e-08   1.181e-05  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)  8.117e+01  1.589e+07       0        1
mpg          2.451e+00  5.979e+04       0        1
cyl         -3.908e+01  2.947e+05       0        1
disp        -1.927e-02  8.518e+03       0        1
hp           3.129e-01  2.283e+04       0        1
drat        -2.735e+01  9.696e+05       0        1
wt          -1.248e+01  6.437e+05       0        1
qsec         1.565e+01  3.845e+05       0        1
am          -4.562e+01  3.632e+05       0        1
gear        -2.835e+01  5.448e+05       0        1
carb         1.788e+01  2.971e+05       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4.3860e+01  on 31  degrees of freedom
Residual deviance: 7.2154e-10  on 21  degrees of freedom
AIC: 22

Number of Fisher Scoring iterations: 25

> model$resample
   Accuracy     Kappa Resample
1 0.8571429 0.6956522    Fold1
2 0.8333333 0.6666667    Fold2
3 0.8571429 0.7200000    Fold3
4 1.0000000 1.0000000    Fold4
5 1.0000000 1.0000000    Fold5


> confusionMatrix(model)
Cross-Validated (5 fold) Confusion Matrix 

(entries are percentual average cell counts across resamples)
 
          Reference
Prediction    0    1
         0 50.0  3.1
         1  6.2 40.6
                            
 Accuracy (average) : 0.9062


> pred.mod  <- predict(model)
> confusionMatrix(data=pred.mod, reference=mtcars$vs)
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 18  0
         1  0 14
                                     
               Accuracy : 1          
                 95% CI : (0.8911, 1)
    No Information Rate : 0.5625     
    P-Value [Acc > NIR] : 1.009e-08  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.5625     
         Detection Rate : 0.5625     
   Detection Prevalence : 0.5625     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : 0          
                                     
                          

This all works fine, but I'd like to obtain the summary(model) information for each fold(meaning the coefficients, p valuess, z scores, etc that you get when performing summary() ), as well as the sensitivity and specificity for each fold if possible. Can someone help?


Solution

  • Is an interesting question. The values you are looking for cannot be obtained directly from the model object, but can be recalculated by knowing which observations of training data are part of which fold. This info can be extracted from the model if you specify savePredictions = "all" in the trainControl function. With the prediction for each k fold you can do something like this:

    #first of all, save all predictions from all folds
    set.seed(123) 
    train.control <- trainControl(method = "cv", number = 5,savePredictions = 
    "all")
    # Train the model
    model <- train(vs ~., data = mtcars, method = "glm",
               trControl = train.control)
    
    #now we can extract the statistics you are looking for
    fold <- unique(pred$Resample)
    mystat <- function(model,x){
    pred <- model$pred
    df <- pred[pred$Resample==x,]
    cm <- confusionMatrix(df$pred,df$obs)
    control <- trainControl(method = "none")
    newdat <- mtcars[pred$rowIndex,]
    fit <- train(vs~.,data=newdat,trControl=control)
    summ <- summary(model)
    z_p <- summ$coefficients[,3:4]
    return(list(cm,z_p))
    }
    stat <- lapply(fold, mystat,model=model)
    names(stat) <- fold
    

    Note that by specifying method="none" in trainControl force train fit the model to the entire training set without any resampling or parameter tuning. in this form, it is not a beautiful function, but it does what you want and you can always adapt it to make it more general.