Search code examples
rmachine-learningregressioncartrpart

Why do RMSE values increase on a smaller tree (RPART)


AIM: I want to understand why does RMSE increase on a smaller tree.

CONTEXT: I am learning the rpart algorithm. I had some data, I split it into three sections (training, validation, testing). I am using this Kaggle dataset.

I fit the model:

homes_model <- rpart(formula = SalePrice ~ ., 
                     data = homes_train, 
                     method = "anova")

With this base tree:

enter image description here

Then, I calculated the RMSE on the test data:

pred_base <- predict(object=homes_model,
                newdata = homes_test)

library(Metrics)
rmse_base <- rmse(actual=homes_test$SalePrice, #Actual values
     predicted = pred_base )

The rmse_base of this first tree is: 46894.

Then, I looked at the cptable to pick the best tree according the lowest xerror+xstd rule.

    CP nsplit rel error xerror  xstd
1  0.446      0      1.00   1.00 0.096
2  0.114      1      0.55   0.56 0.054
3  0.078      2      0.44   0.48 0.055
4  0.035      3      0.36   0.41 0.037
5  0.021      4      0.33   0.40 0.046
6  0.018      5      0.31   0.41 0.047
7  0.017      6      0.29   0.39 0.045
8  0.017      7      0.27   0.39 0.045
9  0.013      8      0.25   0.37 0.043
10 0.010      9      0.24   0.35 0.043

I chose the tree with 7 splits:

opt_index <- 7
cp_opt <- homes_model$cptable[opt_index, "CP"]

# Prune the model (to optimized cp value)
homes_model_opt <- prune(tree = homes_model, 
                         cp = cp_opt)

I plotted it:

enter image description here

Then I calculated the RMSE again on this smaller tree on testing data:

#Computing predicted values 
pred_opt <- predict(object=homes_model_opt,
                newdata = homes_test)

#Compute RMSE
rmse_opt <- rmse(actual=homes_test$SalePrice, #Actual values
     predicted = pred_opt) #Predicted values

It went up from 46894 to 49964. WHY? Shouldn't a smaller tree fit the unseen data better?


Solution

  • There is always a balance between a tree big enough to represent the variation in the data and not so big that it overfits. The reason that bigger trees sometimes produce better results is that they more finely partition the data and so represent nuances. The reason that smaller trees sometimes produce better results is that there is less of a problem with overfitting. But if the smallest tree was always the best, the why not just use one node? Just using the root node would estimate the value using the average - not likely to be really accurate. The two conflicting forces must be balanced to get the best result.