Search code examples
rmachine-learningr-caretgbm

Caret classification thresholds


I have been using a gbm in the caret package in Rstudioto find the probability for the occurrence of a failure.

I have used Youden's J to find a threshold for the best classification, which is 0.63. How do I now use this threshold? I presume the best way to do this is to somehow incorporated the threshold into the gbm model in caret to get more accurate predictions, and then rerun the model on the training data again? Currently it defaults to 0.5 and I can't find an obvious way to update the threshold.

Alternatively, is the threshold just used to separate the test data predictions into the correct class? This seems more straight forward, but how then do I reflect the change in the ROC_AUC plot, assuming the probability should be updated based on the new threshold?

Any help would be gratefully received. Thanks

EDIT: The full code I am working on is as follows:

  
library(datasets)
library(caret)
library(MLeval)
library(dplyr)

data(iris)
data <- as.data.frame(iris)

# create class
data$class <- ifelse(data$Species == "setosa", "yes", "no")

# split into train and test
train <- data %>% sample_frac(.70)
test <- data %>% sample_frac(.30)


# Set up control function for training
ctrl <- trainControl(method = "cv",
                     number = 5, 
                     returnResamp = 'none',
                     summaryFunction = twoClassSummary,
                     classProbs = T,
                     savePredictions = T,
                     verboseIter = F)

# Set up trainng grid - this is based on a hyper-parameter tune that was recently done
gbmGrid <-  expand.grid(interaction.depth = 10,
                        n.trees = 20000,                                          
                        shrinkage = 0.01,                                         
                        n.minobsinnode = 4) 


# Build a standard classifier using a gradient boosted machine
set.seed(5627)
gbm_iris <- train(class ~ .,
                   data = train,
                   method = "gbm",
                   metric = "ROC",
                   tuneGrid = gbmGrid,
                   verbose = FALSE,
                   trControl = ctrl)

# Calcuate best thresholds
caret::thresholder(gbm_iris, threshold = seq(.01,0.99, by = 0.01), final = TRUE, statistics = "all")

pred <- predict(gbm_iris, newdata = test, type = "prob")
roc <- evalm(data.frame(pred, test$class))


Solution

  • There are several problems in your code. I will use the PimaIndiansDiabetes data set from mlbench since it is better suited then the iris data set.

    First of all for splitting data into train and test sets the code:

    train <- data %>% sample_frac(.70)
    test <- data %>% sample_frac(.30)
    

    is not suited since some rows occurring in the train set will also occur in the test set.

    Additionally avoid to use function names as object names, it will save you much headache in the long run.

    data(iris)
    data <- as.data.frame(iris) #bad object name
    

    To the example:

    library(caret)
    library(ModelMetrics)
    library(dplyr)
    library(mlbench)
    
    data(PimaIndiansDiabetes, package = "mlbench")
    

    Create train and test sets, you may use base R sample to sample rows or caret::createDataPartition. createDataPartition is preferable since it tries to preserve the distribution of the response.

    set.seed(123)
    ind <- createDataPartition(PimaIndiansDiabetes$diabetes, 0.7)
    
    
    tr <- PimaIndiansDiabetes[ind$Resample1,]
    ts <- PimaIndiansDiabetes[-ind$Resample1,]
    

    This way no rows in the train set will be in the test set.

    Lets create the model:

    ctrl <- trainControl(method = "cv",
                         number = 5, 
                         returnResamp = 'none',
                         summaryFunction = twoClassSummary,
                         classProbs = T,
                         savePredictions = T,
                         verboseIter = F)
    
    
    gbmGrid <-  expand.grid(interaction.depth = 10,
                            n.trees = 200,                                          
                            shrinkage = 0.01,                                         
                            n.minobsinnode = 4) 
    
    set.seed(5627)
    gbm_pima <- train(diabetes ~ .,
                      data = tr,
                      method = "gbm", #use xgboost
                      metric = "ROC",
                      tuneGrid = gbmGrid,
                      verbose = FALSE,
                      trControl = ctrl)
    

    create a vector of probabilities for thresholder

    probs <- seq(.1, 0.9, by = 0.02)
    
    ths <- thresholder(gbm_pima,
                       threshold = probs,
                       final = TRUE,
                       statistics = "all")
    
    head(ths)
    
    Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall        F1 Prevalence Detection Rate Detection Prevalence
    1     200                10      0.01              4           0.10       1.000  0.02222222      0.6562315      1.0000000 0.6562315  1.000 0.7924209  0.6510595      0.6510595            0.9922078
    2     200                10      0.01              4           0.12       1.000  0.05213675      0.6633439      1.0000000 0.6633439  1.000 0.7975413  0.6510595      0.6510595            0.9817840
    3     200                10      0.01              4           0.14       0.992  0.05954416      0.6633932      0.8666667 0.6633932  0.992 0.7949393  0.6510595      0.6458647            0.9739918
    4     200                10      0.01              4           0.16       0.984  0.07435897      0.6654277      0.7936508 0.6654277  0.984 0.7936383  0.6510595      0.6406699            0.9636022
    5     200                10      0.01              4           0.18       0.984  0.14188034      0.6821550      0.8750000 0.6821550  0.984 0.8053941  0.6510595      0.6406699            0.9401230
    6     200                10      0.01              4           0.20       0.980  0.17179487      0.6886786      0.8833333 0.6886786  0.980 0.8086204  0.6510595      0.6380725            0.9271018
      Balanced Accuracy  Accuracy      Kappa          J      Dist
    1         0.5111111 0.6588517 0.02833828 0.02222222 0.9777778
    2         0.5260684 0.6692755 0.06586592 0.05213675 0.9478632
    3         0.5257721 0.6666781 0.06435166 0.05154416 0.9406357
    4         0.5291795 0.6666781 0.07134190 0.05835897 0.9260250
    5         0.5629402 0.6901572 0.15350721 0.12588034 0.8585308
    6         0.5758974 0.6979836 0.18460584 0.15179487 0.8288729
    

    extract the threshold probability based on your preferred metric

    ths %>%
      mutate(prob = probs) %>%
      filter(J == max(J)) %>%
      pull(prob) -> thresh_prob
    
    thresh_prob
    0.74
    

    predict on test data

    pred <- predict(gbm_pima, newdata = ts, type = "prob")
    

    create a numeric response (0 or 1) based on the response in the test set since this is needed for the functions from package ModelMetrics

    real <- as.numeric(factor(ts$diabetes))-1
    
    ModelMetrics::sensitivity(real, pred$pos, cutoff = thresh_prob)
    0.2238806 #based on this it is clear the threshold chosen is not optimal on this test data
    
    ModelMetrics::specificity(real, pred$pos, cutoff = thresh_prob)
    0.956
    
    ModelMetrics::kappa(real, pred$pos, cutoff = thresh_prob)
    0.2144026  #based on this it is clear the threshold chosen is not optimal on this test data
    
    ModelMetrics::mcc(real, pred$pos, cutoff = thresh_prob)
    0.2776309  #based on this it is clear the threshold chosen is not optimal on this test data
    
    ModelMetrics::auc(real, pred$pos)
    0.8047463  #decent AUC and low mcc and kappa indicate a poor choice of threshold
    

    Auc is a measure over all thresholds so it does not require specification of the cutoff threshold.

    Since only one train/test split is used the performance evaluation will be biased. Best is to use nested resampling so the same can be evaluated over several train/test splits. Here is a way to performed nested resampling.

    EDIT: Answer to the questions in comments.

    To create the roc curve you do not need to calculate sensitivity and specificity on all thresholds you can just use a specified package for such a task. The results are probability going to be more trustworthy.
    I prefer using the pROC package:

    library(pROC)
    
    roc.obj <- roc(real, pred$pos)
    plot(roc.obj, print.thres = "best")
    

    enter image description here

    The best threshold on the figure is the threshold that gives the highest specificity + sensitivity on the test data. It is clear that this threshold (0.289) is much lower compared to the threshold obtained based on cross validated predictions (0.74). This is the reason I said there will be considerable optimistic bias if you adjust the threshold on the cross-validated predictions and use thus obtained performance as an indicator of threshold success.

    In the above example not tuning the threshold would have resulted in better performance on the test set. This might hold true in general for the Pima Indians data set or this might be a case of an unfortunate train/test split. So it is best to validate this sort of thing using nested resampling.