Search code examples
rrandom-forestr-caretpredictprecision-recall

How to input a caret trained random forest model into predict() and performance() functions?


I want to create a precision recall curve using performance() but I don't know how to input my data. I follow this example.

attach(ROCR.simple)
pred <- prediction(ROCR.simple$predictions, ROCR.simple$labels)
perf <- performance(pred,"prec","rec")
plot(perf)

I am trying to mimic that for my caret trained RF model specifically on training data (I know there are various examples on how to use predict on newdata). I tried this:

pred <- prediction(rf_train_model$pred$case, rf_train_model$pred$pred)
perf <- performance(pred,"prec","rec")
plot(perf)

My model below. I tried the above because that seems to match the ROCR.simple data.

#create model
ctrl <- trainControl(method = "cv",
                     number = 5,
                     savePredictions = TRUE,
                     summaryFunction = twoClassSummary,
                     classProbs = TRUE)
set.seed(3949)
rf_train_model <- train(outcome ~ ., data=df_train, 
                  method= "rf",
                  ntree = 1500, 
                  tuneGrid = data.frame(mtry = 33), 
                  trControl = ctrl, 
                  preProc=c("center","scale"), 
                  metric="ROC",
                  importance=TRUE)

> head(rf_train_model$pred)
     pred     obs      case   control rowIndex mtry Resample
1 control control 0.3173333 0.6826667        4   33    Fold1
2 control control 0.3666667 0.6333333        7   33    Fold1
3 control control 0.2653333 0.7346667       16   33    Fold1
4 control control 0.1606667 0.8393333       18   33    Fold1
5 control control 0.2840000 0.7160000       20   33    Fold1
6    case    case 0.6206667 0.3793333       25   33    Fold1

This is wrong because my precision recall curve is going the wrong way. I am interested in more than just the PRAUC curve, although this is a good source that shows how to make it so I would like to fix this error. What error am I making?


Solution

  • If you read the vignette of performance:

    it has to be declared which class label denotes the negative, and which the positive class. Ideally, labels should be supplied as ordered factor(s), the lower level corresponding to the negative class, the upper level to the positive class. If the labels are factors (unordered), numeric, logical or characters, ordering of the labels is inferred from R's built-in < relation (e.g. 0 < 1, -1 < 1, 'a' < 'b', FALSE < TRUE).

    In your case, when you provide rf_train_model$pred$pred, the upper level is still "control", so the best way is to make it TRUE / FALSE. Also you should provide the actual label, not the predicted label, rf_train_model$obs. see below for an example:

    library(caret)
    library(ROCR)
    set.seed(100)
    df = data.frame(matrix(runif(100*100),ncol=100))
    df$outcome = ifelse(runif(100)>0.5,"case","control")
    
    df_train = df[1:80,]
    df_test = df[81:100,]
    
    rf_train_model <- train(outcome ~ ., data=df_train, 
                      method= "rf",
                      ntree = 1500, 
                      tuneGrid = data.frame(mtry = 33), 
                      trControl = ctrl, 
                      preProc=c("center","scale"), 
                      metric="ROC",
                      importance=TRUE)
    
    levels(rf_train_model$pred$pred)
    [1] "case"    "control"
    
    plotCurve = function(label,positive_class,prob){
    pred = prediction(prob,label==positive_class)
    perf <- performance(pred,"prec","rec")
    plot(perf)
    }
    
    plotCurve(rf_train_model$pred$obs,"case",rf_train_model$pred$case)
    plotCurve(rf_test$outcome,"case",predict(rf_train,df_test,type="prob")[,2])