Search code examples
random-forestdecision-tree

when using the default 'randomForest' algorithm for classification, why doesn't the number of terminal nodes match the number of cases?


According to https://cran.r-project.org/web/packages/randomForest/randomForest.pdf, classification trees are fully grown, meaning node size = 1. However, if trees are really grown to a maximum, then shouldn't each terminal node contain a single case (data point, species, etc)? If I run:

library(randomForest)
data(iris) #150 cases
set.seed(352)
rf <- randomForest(Species ~ ., iris)
hist(treesize(rf),main ="number of nodes")

I can see that most "fully grown" trees only have about 10 nodes, meaning node size can't be equal to 1...Right?

for example, (-1) below represents a terminal node for the 134th tree in the forest. Only 8 terminal nodes!?

> getTree(rf,134)
   left daughter right daughter split var split point status prediction
1              2              3         3        2.50      1          0
2              0              0         0        0.00     -1          1
3              4              5         4        1.75      1          0
4              6              7         3        4.95      1          0
5              8              9         3        4.85      1          0
6             10             11         4        1.60      1          0
7             12             13         1        6.50      1          0
8             14             15         1        5.95      1          0
9              0              0         0        0.00     -1          3
10             0              0         0        0.00     -1          2
11             0              0         0        0.00     -1          3
12             0              0         0        0.00     -1          3
13             0              0         0        0.00     -1          2
14             0              0         0        0.00     -1          2
15             0              0         0        0.00     -1          3

I would be greatful if someone can explain


Solution

  • "Fully grown" -> "Nothing left to split". A (node of a-) decision tree is fully grown, if all data records assigned to it hold/make the same prediction.

    In the iris dataset case, once you reach a node with 50 setosa data records in it, it doesn't make sense to split it into two child nodes with 25 and 25 setosas each.