I fitted the following tree
library(rpart)
treeResult = rpart(Species~., data=iris[1:120,], method="class")
and am trying to calculate the yellow number below (0.285714) manually.
I thought this should be the relative reduction in the Gini impurity if the tree goes from 0 to 1 node:
pNode1 = c(50,50,20)/120
pNode2 = c(50,0,0)/50
pNode3 = c(0,50,20)/70
# The counts used to calculate these pNodes are taken from summary(treeResult).
impurityNode1 = sum(pNode1*(1-pNode1))
impurityNode2 = sum(pNode2*(1-pNode2))
impurityNode3 = sum(pNode3*(1-pNode3))
relativeError = (50/120*impurityNode2+70/120*impurityNode3) / impurityNode1
However, this yields 0.3809524 instead of 0.285714.
No. It is not relative Gini impurity. This is displaying relative total impurity.
At the top level node, the impurity is 70/120 = 0.58333. After the first split, one node perfectly classifies 50 points and the other node has a 50/20 split. So there are 20 errors out of 120 points and the impurity at that level is 20/120 = 0.16666. The relative impurity being computed is
(20/120) / (70/120) = 0.16666/0.58333 = 0.285714
For completeness, after the second split there are 3 errors. Relative to the original 70 errors we have 3/70 = 0.042857 (the number that it says next to nsplit=2).