Search code examples
rmachine-learningr-caret

Additional metrics in caret - PPV, sensitivity, specificity


I used caret for logistic regression in R:

  ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, 
                       savePredictions = TRUE)

  mod_fit <- train(Y ~ .,  data=df, method="glm", family="binomial",
                   trControl = ctrl)

  print(mod_fit)

The default metric printed is accuracy and Cohen kappa. I want to extract the matching metrics like sensitivity, specificity, positive predictive value etc. but I cannot find an easy way to do it. The final model is provided but it is trained on all the data (as far as I can tell from documentation), so I cannot use it for predicting anew.

Confusion matrix calculates all required parameters, but passing it as a summary function doesn't work:

  ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, 
                       savePredictions = TRUE, summaryFunction = confusionMatrix)

  mod_fit <- train(Y ~ .,  data=df, method="glm", family="binomial",
                   trControl = ctrl)

Error: `data` and `reference` should be factors with the same levels. 
13.
stop("`data` and `reference` should be factors with the same levels.", 
    call. = FALSE) 
12.
confusionMatrix.default(testOutput, lev, method) 
11.
ctrl$summaryFunction(testOutput, lev, method) 

Is there a way to extract this information in addition to accuracy and kappa, or somehow find it in the train_object returned by the caret train?

Thanks in advance!


Solution

  • Caret already has summary functions to output all the metrics you mention:

    defaultSummary outputs Accuracy and Kappa
    twoClassSummary outputs AUC (area under the ROC curve - see last line of answer), sensitivity and specificity
    prSummary outputs precision and recall

    in order to get combined metrics you can write your own summary function which combines the outputs of these three:

    library(caret)
    MySummary  <- function(data, lev = NULL, model = NULL){
      a1 <- defaultSummary(data, lev, model)
      b1 <- twoClassSummary(data, lev, model)
      c1 <- prSummary(data, lev, model)
      out <- c(a1, b1, c1)
      out}
    

    lets try on the Sonar data set:

    library(mlbench)
    data("Sonar")
    

    when defining the train control it is important to set classProbs = TRUE since some of these metrics (ROC and prAUC) can not be calculated based on predicted class but based on the predicted probabilities.

    ctrl <- trainControl(method = "repeatedcv",
                         number = 10,
                         savePredictions = TRUE,
                         summaryFunction = MySummary,
                         classProbs = TRUE)
    

    Now fit the model of your choice:

    mod_fit <- train(Class ~.,
                     data = Sonar,
                     method = "rf",
                     trControl = ctrl)
    
    mod_fit$results
    #output
      mtry  Accuracy     Kappa       ROC      Sens      Spec       AUC Precision    Recall         F AccuracySD   KappaSD
    1    2 0.8364069 0.6666364 0.9454798 0.9280303 0.7333333 0.8683726 0.8121087 0.9280303 0.8621526 0.10570484 0.2162077
    2   31 0.8179870 0.6307880 0.9208081 0.8840909 0.7411111 0.8450612 0.8074942 0.8840909 0.8374326 0.06076222 0.1221844
    3   60 0.8034632 0.6017979 0.9049242 0.8659091 0.7311111 0.8332068 0.7966889 0.8659091 0.8229330 0.06795824 0.1369086
           ROCSD     SensSD    SpecSD      AUCSD PrecisionSD   RecallSD        FSD
    1 0.04393947 0.05727927 0.1948585 0.03410854  0.12717667 0.05727927 0.08482963
    2 0.04995650 0.11053858 0.1398657 0.04694993  0.09075782 0.11053858 0.05772388
    3 0.04965178 0.12047598 0.1387580 0.04820979  0.08951728 0.12047598 0.06715206
    

    in this output ROC is in fact the area under the ROC curve - usually called AUC
    and AUC is the area under the precision-recall curve across all cutoffs.