This post mentions that Caret rpart is more accurate than rpart due to bootstrapping and cross validation:
Why do results using caret::train(..., method = "rpart") differ from rpart::rpart(...)?
Although when I compare both methods, I get an accuracy of 0.4879 for Caret rpart and 0.7347 for rpart (I have copied my code below).
Besides that the classificationtree for Caret rpart has only a few nodes (splits) compared to rpart
Does anyone understand these differences?
Thank you!
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Loading libraries and the data
This is an R Markdown document. First we load the libraries and the data and split the trainingdata into a training and a testset.
```{r section1, echo=TRUE}
# load libraries
library(knitr)
library(caret)
suppressMessages(library(rattle))
library(rpart.plot)
# set the URL for the download
wwwTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
wwwTest <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# download the datasets
training <- read.csv(url(wwwTrain))
testing <- read.csv(url(wwwTest))
# create a partition with the training dataset
inTrain <- createDataPartition(training$classe, p=0.05, list=FALSE)
TrainSet <- training[inTrain, ]
TestSet <- training[-inTrain, ]
dim(TrainSet)
# set seed for reproducibility
set.seed(12345)
```
## Cleaning the data
```{r section2, echo=TRUE}
# remove variables with Nearly Zero Variance
NZV <- nearZeroVar(TrainSet)
TrainSet <- TrainSet[, -NZV]
TestSet <- TestSet[, -NZV]
dim(TrainSet)
dim(TestSet)
# remove variables that are mostly NA
AllNA <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.95
TrainSet <- TrainSet[, AllNA==FALSE]
TestSet <- TestSet[, AllNA==FALSE]
dim(TrainSet)
dim(TestSet)
# remove identification only variables (columns 1 to 5)
TrainSet <- TrainSet[, -(1:5)]
TestSet <- TestSet[, -(1:5)]
dim(TrainSet)
```
## Prediction modelling
First we build a classification model using Caret with the rpart method:
```{r section4, echo=TRUE}
mod_rpart <- train(classe ~ ., method = "rpart", data = TrainSet)
pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)
mod_rpart$finalModel
fancyRpartPlot(mod_rpart$finalModel)
```
Second we build a similar model using rpart:
```{r section7, echo=TRUE}
# model fit
set.seed(12345)
modFitDecTree <- rpart(classe ~ ., data=TrainSet, method="class")
fancyRpartPlot(modFitDecTree)
# prediction on Test dataset
predictDecTree <- predict(modFitDecTree, newdata=TestSet, type="class")
confMatDecTree <- confusionMatrix(predictDecTree, TestSet$classe)
confMatDecTree
```
A simple explanation is that you did not tune either models, and at the default settings rpart performed better by pure chance.
When you do use the same parameters then you should expect the same performance.
Lets do some tuning with caret
:
set.seed(1)
mod_rpart <- train(classe ~ .,
method = "rpart",
data = TrainSet,
tuneLength = 50,
metric = "Accuracy",
trControl = trainControl(method = "repeatedcv",
number = 4,
repeats = 5,
summaryFunction = multiClassSummary,
classProbs = TRUE))
pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)
#output
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 4359 243 92 135 38
B 446 2489 299 161 276
C 118 346 2477 300 92
D 190 377 128 2240 368
E 188 152 254 219 2652
Overall Statistics
Accuracy : 0.7628
95% CI : (0.7566, 0.7688)
No Information Rate : 0.2844
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7009
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.8223 0.6900 0.7622 0.7332 0.7741
Specificity 0.9619 0.9214 0.9444 0.9318 0.9466
Pos Pred Value 0.8956 0.6780 0.7432 0.6782 0.7654
Neg Pred Value 0.9316 0.9253 0.9495 0.9469 0.9490
Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
Detection Rate 0.2339 0.1335 0.1329 0.1202 0.1423
Detection Prevalence 0.2611 0.1970 0.1788 0.1772 0.1859
Balanced Accuracy 0.8921 0.8057 0.8533 0.8325 0.8603
that is a bit better then rpart
with default settings (cp = 0.01
)
how about if we set the optimal cp as chosen by caret:
modFitDecTree <- rpart(classe ~ .,
data = TrainSet,
method = "class",
control = rpart.control(cp = mod_rpart$bestTune))
predictDecTree <- predict(modFitDecTree, newdata = TestSet, type = "class" )
confusionMatrix(predictDecTree, TestSet$classe)
#part of ouput
Accuracy : 0.7628