Search code examples
rdecision-treethresholdconfusion-matrix

set threshold for the probability result from decision tree


I tried to calculate the confusion matrix after I conduct the decision tree model

# tree model
tree <- rpart(LoanStatus_B ~.,data=train, method='class')
# confusion matrix
pdata <- predict(tree, newdata = test, type = "class")
confusionMatrix(data = pdata, reference = test$LoanStatus_B, positive = "1")

How can I set the threshold to my confusion matrix, say maybe I want probability above 0.2 as default, which is the binary outcome.


Solution

  • Several things to note here. Firstly, make sure you're getting class probabilities when you do your predictions. With prediction type ="class" you were just getting discrete classes, so what you wanted would've been impossible. So you'll want to make it "p" like mine below.

    library(rpart)
    data(iris)
    
    iris$Y <- ifelse(iris$Species=="setosa",1,0)
    
    # tree model
    tree <- rpart(Y ~Sepal.Width,data=iris, method='class')
    
    # predictions
    pdata <- as.data.frame(predict(tree, newdata = iris, type = "p"))
    head(pdata)
    
    # confusion matrix
    table(iris$Y, pdata$`1` > .5)
    

    Next note that .5 here is just an arbitrary value -- you can change it to whatever you want.

    I don't see a reason to use the confusionMatrix function, when a confusion matrix can be created simply this way and allows you to acheive your goal of easily changing the cutoff.

    Having said that, if you do want to use the confusionMatrix function for your confusion matrix, then just create a discrete class prediction first based on your custom cutoff like this:

    pdata$my_custom_predicted_class <- ifelse(pdata$`1` > .5, 1, 0)
    

    Where, again, .5 is your custom chosen cutoff and can be anything you want it to be.

    caret::confusionMatrix(data = pdata$my_custom_predicted_class, 
                      reference = iris$Y, positive = "1")
    
    Confusion Matrix and Statistics
    
              Reference
    Prediction  0  1
             0 94 19
             1  6 31
    
                   Accuracy : 0.8333          
                     95% CI : (0.7639, 0.8891)
        No Information Rate : 0.6667          
        P-Value [Acc > NIR] : 3.661e-06       
    
                      Kappa : 0.5989          
     Mcnemar's Test P-Value : 0.0164          
    
                Sensitivity : 0.6200          
                Specificity : 0.9400          
             Pos Pred Value : 0.8378          
             Neg Pred Value : 0.8319          
                 Prevalence : 0.3333          
             Detection Rate : 0.2067          
       Detection Prevalence : 0.2467          
          Balanced Accuracy : 0.7800          
    
           'Positive' Class : 1