Search code examples
rrpart

Key methods selection in rpart


What's the rule to select the complexity parameter (cp) and method in rpart() function from the rpart? I've read a couple articles regarding the package, but the contents was too technical for me to grok.

Example:

rpart_1 <- rpart(myFormula, data = kyphosis,
                 method = "class", 
                 control = rpart.control(minsplit = 0, cp = 0))
plotcp(rpart_1)
printcp(rpart_1)

Solution

  • You typically don't choose the method parameter as such; it's chosen for you as part of the problem you're solving. If it's a classification problem, you use method="class", if it's a regression problem, you use method="anova", and so on. Naturally, this means you have to understand what the problem is you're trying to solve, and whether your data will let you solve it.

    The cp parameter controls the size of the fitted tree. You choose its value via cross-validation or using a separate test dataset. rpart is somewhat different to most other R modelling packages in how it handles this. The rpart function does cross-validation automatically by default. You then examine the model to see the result of the cross-validation, and prune the model based on that.

    Worked example, using the MASS::Boston dataset:

    library(MASS)
    
    # does 10-fold CV by default
    Bos.tree <- rpart(medv ~ ., data=Boston, cp=0)
    
    # look at the result of the CV
    plotcp(Bos.tree)
    

    enter image description here

    The plot shows that the 10-fold cross-validated error flattens out beginning at a tree size of about 9 leaf nodes. The dotted line is the minimum of the curve plus 1 standard error, which is a standard rule of thumb for pruning decision trees: you pick the smallest tree size that is within 1 SE of the minimum.

    Printing the CP values gives a more precise view of how to choose the tree size:

    printcp(Bos.tree)
    
               #CP nsplit rel error  xerror     xstd
    #1  0.45274420      0   1.00000 1.00355 0.082973
    #2  0.17117244      1   0.54726 0.61743 0.057053
    #3  0.07165784      2   0.37608 0.43034 0.046596
    #4  0.03616428      3   0.30443 0.34251 0.042502
    #5  0.03336923      4   0.26826 0.32642 0.040456
    #6  0.02661300      5   0.23489 0.32591 0.040940
    #7  0.01585116      6   0.20828 0.29324 0.040908
    #8  0.00824545      7   0.19243 0.28256 0.039576
    #9  0.00726539      8   0.18418 0.27334 0.037122
    #10 0.00693109      9   0.17692 0.27593 0.037326
    #11 0.00612633     10   0.16999 0.27467 0.037310
    #12 0.00480532     11   0.16386 0.26547 0.036897
    # . . .
    

    This shows that a CP value of 0.00612 corresponds to a tree with 10 splits (and hence 11 leaves). This is the value of cp you use to prune the tree. So:

    # prune with a value of cp slightly greater than 0.00612633
    Bos.tree.cv <- prune(Bos.tree, cp=0.00613)