Search code examples
rmachine-learningdata-miningdecision-treerpart

The prp() function from rpart in R only plots a single leaf node. Why?


I am learning how to code in R for machine learning. I am using rpart to do the heavy lifting. However, when I go to plot my decision tree, only a leaf node 'yes' is plotted. I've created the decision tree myself by hand using information gain. The tree should have three levels of nodes.

decision tree by hand

Here is what R gives me.

decision tree plot from rpr()

Here is my R code.

library(FSelector)
library(rpart)
library(rpart.plot)
library(caret)
library(dplyr)
library(data.tree)
library(caTools)
table <- read.csv("play-data.csv")
table <- select(table, Outlook, Temperature, Humidity, Windy, Play)
table <- mutate(table, Outlook = factor(Outlook), Temperature = factor(Temperature), Humidity = factor(Humidity), Play = factor(Play))
tree <- rpart(Play ~ Outlook + Temperature + Humidity + Windy, data = table)
prp(tree)

Here is the data from 'play-data.csv'.

play-data.csv

The data is being read in correctly, and the selection and mutation functions seem to be fine as well. So I don't know what gives. I tried Googling the problem but only found one other thread about it with no concise answer that I can understand.


Solution

  • You are getting a tree with a single node because you are using the default settings for rpart. The documentation is a little indirect. The documentation tells you that there is a parameter called control and says "See rpart.control." If you click through to the documentation for rpart.control, you will see that there is a parameter called minsplit which is described as "the minimum number of observations that must exist in a node in order for a split to be attempted." The default value is 20 and you only have 14 data points altogether. It will not split the root node. Instead, use rpart.control to set minsplit to a lower value (try 2).