Search code examples
rmachine-learningrandom-forestr-caret

How to set a ppv in caret for random forest in r?


So I'm interested in creating a model that optimizes PPV. I've create a RF model (below) that outputs me a confusion matrix, for which I then manually calculate sensitivity, specificity, ppv, npv, and F1. I know right now accuracy is optimized but I'm willing to forgo sensitivity and specificity to get a much higher ppv.

data_ctrl_null <- trainControl(method="cv", number = 5, classProbs = TRUE, summaryFunction=twoClassSummary, savePredictions=T, sampling=NULL)

set.seed(5368)

model_htn_df <- train(outcome ~ ., data=htn_df, ntree = 1000, tuneGrid = data.frame(mtry = 38), trControl = data_ctrl_null, method= "rf", 
                           preProc=c("center","scale"),metric="ROC", importance=TRUE)

model_htn_df$finalModel #provides confusion matrix

Results:

Call:
  randomForest(x = x, y = y, ntree = 1000, mtry = param$mtry, importance = TRUE) 
           Type of random forest: classification
                 Number of trees: 1000
  No. of variables tried at each split: 38

    OOB estimate of  error rate: 16.2%
    Confusion matrix:
      no yes class.error
 no  274  19  0.06484642
 yes  45  57  0.44117647

My manual calculation: sen = 55.9% spec = 93.5%, ppv = 75.0%, npv = 85.9% (The confusion matrix switches my no and yes as outcomes, so I also switch the numbers when I calculate the performance metrics.)

So what do I need to do to get a PPV = 90%?

This is a similar question, but I'm not really following it.


Solution

  • We define a function to calculate PPV and return the results with a name:

    PPV <- function (data,lev = NULL,model = NULL) {
       value <- posPredValue(data$pred,data$obs, positive = lev[1])
       c(PPV=value)
    }
    

    Let's say we have the following data:

    library(randomForest)
    library(caret)
    data=iris
    data$Species = ifelse(data$Species == "versicolor","versi","others")
    trn = sample(nrow(iris),100)
    

    Then we train by specifying PPV to be the metric:

    mdl <- train(Species ~ ., data = data[trn,],
                 method = "rf",
                 metric = "PPV",
                 trControl = trainControl(summaryFunction = PPV, 
                                          classProbs = TRUE))
    
    Random Forest 
    
    100 samples
      4 predictor
      2 classes: 'others', 'versi' 
    
    No pre-processing
    Resampling: Bootstrapped (25 reps) 
    Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... 
    Resampling results across tuning parameters:
    
      mtry  PPV      
      2     0.9682811
      3     0.9681759
      4     0.9648426
    
    PPV was used to select the optimal model using the largest value.
    The final value used for the model was mtry = 2.
    

    Now you can see it is trained on PPV. However you cannot force the training to achieve a PPV of 0.9.. It really depends on the data, if your independent variables have no predictive power, it will not improve however much you train it right?