Quick question on R tree models. I want to produce a tree model on a lot of variables (mostly numeric or factor variables). One of the variables is Gender where the categories are male, female and unknown. When I use the tree
or rpart
function from the tree
and rpart
libraries I only get two branches from the Gender root. The unknown gender has being grouped with the females to form a single category. So the branches I am getting are Female+Unknown and Male. I checked the tree package pdf http://cran.r-project.org/web/packages/tree/tree.pdf and it says that the levels of an unordered factor are divided into two non-empty groups. The rpart function appears to very similar to the tree function in terms of handling factors with more than 2 levels.
My question is therefore are there any other functions or packages in R that will let me produce more than 3 branches from a single node or does anyone have any suggestions an other open source tools that will do the same. Let me know if you need any more information.
rpart()
is perfectly capable for handling response with more than 2 categories. Try:
require(rpart)
mod <- rpart(Species ~ ., data = iris)
mod
plot(mod)
text(mod)
Which produces a tree with 3 terminal nodes when run using the default settings:
R> mod
n= 150
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)
2) Petal.Length< 2.45 50 0 setosa (1.00000000 0.00000000 0.00000000) *
3) Petal.Length>=2.45 100 50 versicolor (0.00000000 0.50000000 0.50000000)
6) Petal.Width< 1.75 54 5 versicolor (0.00000000 0.90740741 0.09259259) *
7) Petal.Width>=1.75 46 1 virginica (0.00000000 0.02173913 0.97826087) *
The recursive partitioning algorithm will stop building a tree when certain stopping rules are met (there is no point splitting if a node is already pure [of a single class], and by default a node has to have 20+ observations for it to be split, and will also stop splitting a given node if it has less than 7 observations, or if no further splits will improve the lack of fit by a factor of 0.01, and so on). Some of these can be controlled from the rpart.control()
function.
From what limited information you have given us, I can only conclude that these defaults are inappropriate for your data set and you should adjust them accordingly, e.g.:
ctrl <- rpart.control(minsplit = 2, minbucket = 1, cp = 0.00001)
mod2 <- rpart(Species ~ ., data = iris, control = ctrl)
mod2
plot(mod2)
text(mod2)
Which for this exmaple data set produces a much larger tree:
R> mod2
n= 150
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)
2) Petal.Length< 2.45 50 0 setosa (1.00000000 0.00000000 0.00000000) *
3) Petal.Length>=2.45 100 50 versicolor (0.00000000 0.50000000 0.50000000)
6) Petal.Width< 1.75 54 5 versicolor (0.00000000 0.90740741 0.09259259)
12) Petal.Length< 4.95 48 1 versicolor (0.00000000 0.97916667 0.02083333)
24) Petal.Width< 1.65 47 0 versicolor (0.00000000 1.00000000 0.00000000) *
25) Petal.Width>=1.65 1 0 virginica (0.00000000 0.00000000 1.00000000) *
13) Petal.Length>=4.95 6 2 virginica (0.00000000 0.33333333 0.66666667)
26) Petal.Width>=1.55 3 1 versicolor (0.00000000 0.66666667 0.33333333)
52) Sepal.Length< 6.95 2 0 versicolor (0.00000000 1.00000000 0.00000000) *
53) Sepal.Length>=6.95 1 0 virginica (0.00000000 0.00000000 1.00000000) *
27) Petal.Width< 1.55 3 0 virginica (0.00000000 0.00000000 1.00000000) *
7) Petal.Width>=1.75 46 1 virginica (0.00000000 0.02173913 0.97826087)
14) Petal.Length< 4.85 3 1 virginica (0.00000000 0.33333333 0.66666667)
28) Sepal.Length< 5.95 1 0 versicolor (0.00000000 1.00000000 0.00000000) *
29) Sepal.Length>=5.95 2 0 virginica (0.00000000 0.00000000 1.00000000) *
15) Petal.Length>=4.85 43 0 virginica (0.00000000 0.00000000 1.00000000) *
but is most likely highly over-fitted to the data.
That said, there are, of course, other packages that can fit trees to data sets that like rpart()
can handle response with more than two levels. The main ones are listed on the Machine Learning & Statistical Learning Task View on CRAN, which you should consult. One such package is party.