What's the rule to select the complexity parameter (cp
) and method in rpart()
function from the rpart
? I've read a couple articles regarding the package, but the contents was too technical for me to grok.
Example:
rpart_1 <- rpart(myFormula, data = kyphosis,
method = "class",
control = rpart.control(minsplit = 0, cp = 0))
plotcp(rpart_1)
printcp(rpart_1)
You typically don't choose the method
parameter as such; it's chosen for you as part of the problem you're solving. If it's a classification problem, you use method="class"
, if it's a regression problem, you use method="anova"
, and so on. Naturally, this means you have to understand what the problem is you're trying to solve, and whether your data will let you solve it.
The cp
parameter controls the size of the fitted tree. You choose its value via cross-validation or using a separate test dataset. rpart
is somewhat different to most other R modelling packages in how it handles this. The rpart
function does cross-validation automatically by default. You then examine the model to see the result of the cross-validation, and prune the model based on that.
Worked example, using the MASS::Boston
dataset:
library(MASS)
# does 10-fold CV by default
Bos.tree <- rpart(medv ~ ., data=Boston, cp=0)
# look at the result of the CV
plotcp(Bos.tree)
The plot shows that the 10-fold cross-validated error flattens out beginning at a tree size of about 9 leaf nodes. The dotted line is the minimum of the curve plus 1 standard error, which is a standard rule of thumb for pruning decision trees: you pick the smallest tree size that is within 1 SE of the minimum.
Printing the CP values gives a more precise view of how to choose the tree size:
printcp(Bos.tree)
#CP nsplit rel error xerror xstd
#1 0.45274420 0 1.00000 1.00355 0.082973
#2 0.17117244 1 0.54726 0.61743 0.057053
#3 0.07165784 2 0.37608 0.43034 0.046596
#4 0.03616428 3 0.30443 0.34251 0.042502
#5 0.03336923 4 0.26826 0.32642 0.040456
#6 0.02661300 5 0.23489 0.32591 0.040940
#7 0.01585116 6 0.20828 0.29324 0.040908
#8 0.00824545 7 0.19243 0.28256 0.039576
#9 0.00726539 8 0.18418 0.27334 0.037122
#10 0.00693109 9 0.17692 0.27593 0.037326
#11 0.00612633 10 0.16999 0.27467 0.037310
#12 0.00480532 11 0.16386 0.26547 0.036897
# . . .
This shows that a CP value of 0.00612 corresponds to a tree with 10 splits (and hence 11 leaves). This is the value of cp
you use to prune the tree. So:
# prune with a value of cp slightly greater than 0.00612633
Bos.tree.cv <- prune(Bos.tree, cp=0.00613)