Search code examples
rr-caretrpart

Rpart vs. caret rpart "Error: There were missing values in resampled performance measures"


I use the caret package and tried to use the rpart method. Interestingly, I can fit a model with the general rpart package, but as soon as I use the caret package, it no longer works. What further puzzles me is that I have seen on various websites that rpart within caret was used e.g. for the Boston data.

I am confused whether I implemented the model incorrectly or whether I missed a point here. For rpart_tree2 (below) I get the following error message: "In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures."

I know that I could also specify e.g. repeatedcv, but that makes no difference regarding the error message.

Below you will find a MWE (I tried to keep it as simple as possible):

library(caret)
library(rpart)

data("Boston")

index <- sample(nrow(Boston),nrow(Boston)*0.75)
Boston.train <- Boston[index,]
Boston.test <- Boston[-index,]

rpart_tree1 <- rpart(medv ~ ., data = Boston.train)

rpart_tree2 <- train(medv ~., data = Boston.train, method = "rpart")

Solution

  • The warning is not a problem.

    With larger cp values in some resamples the produced tree has no splits. When a tree has no splits the predicted value is the mean of train outcome values. Since the predicted values have no variance the cor function throws a warning and the result is NA. This function is used to calculate RSquared - hence for these resamples RSquared is NA - in other words it is missing - what the warning implies.

    Example:

    library(caret)
    library(rpart)
    library(MASS)
    data(Boston)
    
    set.seed(1)
    index <- sample(nrow(Boston),nrow(Boston)*0.75)
    Boston.train <- Boston[index,]
    Boston.test <- Boston[-index,]
    

    lower cp do not produce warnings:

    rpart_tree2 <- train(medv ~., data = Boston.train, method = "rpart",
                         tuneGrid = data.frame(cp = c(0.01, 0.05, 0.1)))
    

    when I specify a higher cp and a specific seed:

    set.seed(111)
    rpart_tree3 <- train(medv ~., data = Boston.train, method = "rpart",
                         tuneGrid = data.frame(cp = c(0.4)),
                         trControl = trainControl(savePredictions = TRUE))
    
    Warning message:
    In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
      There were missing values in resampled performance measures.
    

    To inspect the problem:

    rpart_tree3$resample
            RMSE  Rsquared      MAE   Resample
    1   7.530482 0.4361392 5.708437 Resample01
    2   7.334995 0.2350619 5.392867 Resample02
    3   7.178178 0.3971089 5.511530 Resample03
    4   6.369189 0.2798907 4.851146 Resample04
    5   7.550175 0.3344412 5.566677 Resample05
    6   7.019099 0.4270561 5.160572 Resample06
    7   7.197384 0.4530680 5.665177 Resample07
    8   7.206760 0.3447690 5.290300 Resample08
    9   7.408748 0.4553087 5.513998 Resample09
    10  7.241468 0.4119979 5.452725 Resample10
    11  7.562511 0.3967082 5.768643 Resample11
    12  7.347378 0.3861702 5.225532 Resample12
    13  7.124039 0.4039857 5.599800 Resample13
    14  7.151013 0.3301835 5.490676 Resample14
    15  6.518536 0.3835073 4.938662 Resample15
    16 10.008008        NA 7.174290 Resample16
    17  7.018742 0.4431380 5.379823 Resample17
    18  7.454669 0.3888220 6.000062 Resample18
    19  6.745457 0.3772237 5.175481 Resample19
    20  6.864304 0.4179276 5.089924 Resample20
    21  7.238874 0.2378432 5.234752 Resample21
    22  7.581736 0.3707839 5.543641 Resample22
    23  7.236317 0.3431725 5.278693 Resample23
    24  7.232241 0.4196955 5.518907 Resample24
    25  6.641846 0.3664023 4.683834 Resample25
    

    We can see the problem occurred in Resample16

    library(tidyverse)
    rpart_tree3$pred %>%
      filter(Resample == "Resample16") -> for_cor
    head(for_cor)
          pred  obs rowIndex  cp   Resample
    1 21.87018 15.6        1 0.4 Resample16
    2 21.87018 22.3        3 0.4 Resample16
    3 21.87018 13.4        6 0.4 Resample16
    4 21.87018 12.7       10 0.4 Resample16
    5 21.87018 18.6       11 0.4 Resample16
    6 21.87018 19.0       13 0.4 Resample16
    

    We can see pred is the same for every row of Resample16

     cor(for_cor$pred, for_cor$obs, use = "pairwise.complete.obs")
    [1] NA
    Warning message:
    In cor(for_cor$pred, for_cor$obs, use = "pairwise.complete.obs") :
      the standard deviation is zero
    

    To see how RSquared is calculated in caret check out the source for postResample. Basically cor(pred, obs)^2