How does the box.col() colouring work for prp tree graphs? I would like to colour terminal nodes using three colours either based on any 3 age categories or based on any 3 groupings of node numbers (for my actual data the two increase together so colouring based on outcome value or node number will work).
I've read the package documentation and vignette but still have no clue where to begin, even just for two groups. Below are two examples I've tried in an attempt to control 2 colours. The first is seemingly random and the second, although it will apparently colour based on the fitted node value, doesn't output any colour at all.
library(rpart)
library(rpart.plot)
data(ptitanic)
tree <- rpart(age ~ ., data = ptitanic)
prp(tree, extra = 1, faclen=0, nn = T,
box.col=c("green", "red")) #apparently random colouring?
prp(tree, extra = 1, faclen=0, nn = T,
box.col=c("green", "red")[tree$frame$yval]) #no colour
Turns out specifying conditional box.col statements isn't that different from specifying statements to conditionally colour other graphs and I found this post useful in coming up with a solution: Using Conditional Statements to Change the Color of Data Points
The key is that tree$frame
gives a dataframe that can be used to help specify conditional statements (see rpart documentation). The yval
variable holds the predicted outcome of interest (in this case age) and can be used to dictate the colouring.
Here are solutions to colour with 2 colours and 3 colours:
# 2 colours
# use ifelse: if predicted age > 30 colour red, else colour green
prp(tree, extra = 1, faclen=0, nn = T,
box.col=ifelse(tree$frame$yval > 30, 'red', 'green'))
# 3 colours
# use findInterval: if age [0,20) then green; if age [20,30) then orange, else red
prp(tree, extra = 1, faclen=0, nn = T,
box.col=c("green", "orange", "red")[findInterval(tree$frame$yval, v = c(0,20,30))])
Node Number isn't stored in tree$frame
so I'm not sure how to colour boxes based on node number but for my purposes the above solution will work.