Search code examples
rr-caretrpart

Why is rpart more accurate than Caret rpart in R


This post mentions that Caret rpart is more accurate than rpart due to bootstrapping and cross validation:

Why do results using caret::train(..., method = "rpart") differ from rpart::rpart(...)?

Although when I compare both methods, I get an accuracy of 0.4879 for Caret rpart and 0.7347 for rpart (I have copied my code below).

Besides that the classificationtree for Caret rpart has only a few nodes (splits) compared to rpart

Does anyone understand these differences?

Thank you!

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Loading libraries and the data

This is an R Markdown document. First we load the libraries and the data and split the trainingdata into a training and a testset.

```{r section1, echo=TRUE}

# load libraries
library(knitr)
library(caret)
suppressMessages(library(rattle))
library(rpart.plot)

# set the URL for the download
wwwTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
wwwTest  <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

# download the datasets
training <- read.csv(url(wwwTrain))
testing  <- read.csv(url(wwwTest))

# create a partition with the training dataset 
inTrain  <- createDataPartition(training$classe, p=0.05, list=FALSE)
TrainSet <- training[inTrain, ]
TestSet  <- training[-inTrain, ]
dim(TrainSet)

# set seed for reproducibility        
set.seed(12345)

```
## Cleaning the data

```{r section2, echo=TRUE}
# remove variables with Nearly Zero Variance
NZV <- nearZeroVar(TrainSet)
TrainSet <- TrainSet[, -NZV]
TestSet  <- TestSet[, -NZV]
dim(TrainSet)
dim(TestSet)

# remove variables that are mostly NA
AllNA    <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.95
TrainSet <- TrainSet[, AllNA==FALSE]
TestSet  <- TestSet[, AllNA==FALSE]
dim(TrainSet)
dim(TestSet)

# remove identification only variables (columns 1 to 5)
TrainSet <- TrainSet[, -(1:5)]
TestSet  <- TestSet[, -(1:5)]
dim(TrainSet)


```

## Prediction modelling

First we build a classification model using Caret with the rpart method:
```{r section4, echo=TRUE}

mod_rpart <- train(classe ~ ., method = "rpart", data = TrainSet)

pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)

mod_rpart$finalModel
fancyRpartPlot(mod_rpart$finalModel)

```

Second we build a similar model using rpart:
```{r section7, echo=TRUE}

# model fit
set.seed(12345)
modFitDecTree <- rpart(classe ~ ., data=TrainSet, method="class")
fancyRpartPlot(modFitDecTree)

# prediction on Test dataset
predictDecTree <- predict(modFitDecTree, newdata=TestSet, type="class")
confMatDecTree <- confusionMatrix(predictDecTree, TestSet$classe)
confMatDecTree

```

Solution

  • A simple explanation is that you did not tune either models, and at the default settings rpart performed better by pure chance.

    When you do use the same parameters then you should expect the same performance.

    Lets do some tuning with caret:

    set.seed(1)
    mod_rpart <- train(classe ~ .,
                       method = "rpart",
                       data = TrainSet,
                       tuneLength = 50, 
                       metric = "Accuracy",
                       trControl = trainControl(method = "repeatedcv",
                                                number = 4,
                                                repeats = 5,
                                                summaryFunction = multiClassSummary,
                                                classProbs = TRUE))
    
    pred_rpart <- predict(mod_rpart, TestSet)
    confusionMatrix(pred_rpart, TestSet$classe)
    #output
    Confusion Matrix and Statistics
    
              Reference
    Prediction    A    B    C    D    E
             A 4359  243   92  135   38
             B  446 2489  299  161  276
             C  118  346 2477  300   92
             D  190  377  128 2240  368
             E  188  152  254  219 2652
    
    Overall Statistics
    
                   Accuracy : 0.7628          
                     95% CI : (0.7566, 0.7688)
        No Information Rate : 0.2844          
        P-Value [Acc > NIR] : < 2.2e-16       
    
                      Kappa : 0.7009          
     Mcnemar's Test P-Value : < 2.2e-16       
    
    Statistics by Class:
    
                         Class: A Class: B Class: C Class: D Class: E
    Sensitivity            0.8223   0.6900   0.7622   0.7332   0.7741
    Specificity            0.9619   0.9214   0.9444   0.9318   0.9466
    Pos Pred Value         0.8956   0.6780   0.7432   0.6782   0.7654
    Neg Pred Value         0.9316   0.9253   0.9495   0.9469   0.9490
    Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
    Detection Rate         0.2339   0.1335   0.1329   0.1202   0.1423
    Detection Prevalence   0.2611   0.1970   0.1788   0.1772   0.1859
    Balanced Accuracy      0.8921   0.8057   0.8533   0.8325   0.8603
    

    that is a bit better then rpart with default settings (cp = 0.01)

    how about if we set the optimal cp as chosen by caret:

    modFitDecTree <- rpart(classe ~ .,
                           data = TrainSet,
                           method = "class",
                           control = rpart.control(cp = mod_rpart$bestTune))
    
    predictDecTree <- predict(modFitDecTree, newdata = TestSet, type = "class" )
    confusionMatrix(predictDecTree, TestSet$classe)
    #part of ouput
    Accuracy : 0.7628