AIM: I want to understand why does RMSE
increase on a smaller tree.
CONTEXT: I am learning the rpart
algorithm. I had some data, I split it into three sections (training, validation, testing). I am using this Kaggle dataset.
I fit the model:
homes_model <- rpart(formula = SalePrice ~ .,
data = homes_train,
method = "anova")
With this base tree:
Then, I calculated the RMSE on the test data:
pred_base <- predict(object=homes_model,
newdata = homes_test)
library(Metrics)
rmse_base <- rmse(actual=homes_test$SalePrice, #Actual values
predicted = pred_base )
The rmse_base
of this first tree is: 46894
.
Then, I looked at the cptable
to pick the best tree according the lowest xerror+xstd
rule.
CP nsplit rel error xerror xstd
1 0.446 0 1.00 1.00 0.096
2 0.114 1 0.55 0.56 0.054
3 0.078 2 0.44 0.48 0.055
4 0.035 3 0.36 0.41 0.037
5 0.021 4 0.33 0.40 0.046
6 0.018 5 0.31 0.41 0.047
7 0.017 6 0.29 0.39 0.045
8 0.017 7 0.27 0.39 0.045
9 0.013 8 0.25 0.37 0.043
10 0.010 9 0.24 0.35 0.043
I chose the tree with 7 splits:
opt_index <- 7
cp_opt <- homes_model$cptable[opt_index, "CP"]
# Prune the model (to optimized cp value)
homes_model_opt <- prune(tree = homes_model,
cp = cp_opt)
I plotted it:
Then I calculated the RMSE
again on this smaller tree on testing data:
#Computing predicted values
pred_opt <- predict(object=homes_model_opt,
newdata = homes_test)
#Compute RMSE
rmse_opt <- rmse(actual=homes_test$SalePrice, #Actual values
predicted = pred_opt) #Predicted values
It went up from 46894
to 49964
. WHY? Shouldn't a smaller tree fit the unseen data better?
There is always a balance between a tree big enough to represent the variation in the data and not so big that it overfits. The reason that bigger trees sometimes produce better results is that they more finely partition the data and so represent nuances. The reason that smaller trees sometimes produce better results is that there is less of a problem with overfitting. But if the smallest tree was always the best, the why not just use one node? Just using the root node would estimate the value using the average - not likely to be really accurate. The two conflicting forces must be balanced to get the best result.