I made a short code in R to check how split criterias work. I got unexpected results, all of them choose the same value to split. Can someone explain it? Here is the code:
set.seed(1)
y <- sample(c(1, 0), 10000, replace = T)
x <- seq(1, 10000)
data <- data.frame(x, y)
library(rpart)
rpart(y~x,data = data,parms=list(split="gini"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
rpart(y~x,data = data,parms=list(split="information"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
rpart(y~x,data = data,method = "anova",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
In my case only the last rpart
command did split something:
> set.seed(1)
> y <- sample(c(1, 0), 1000, replace = T)
> x <- seq(1, 1000)
> data <- data.frame(x, y)
> library(rpart)
No split with split="gini"
:
> rpart(y~x,data = data,parms=list(split="gini"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 1000 480 1 (0.4800000 0.5200000) *
No split with split="information"
:
> rpart(y~x,data = data,parms=list(split="information"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 1000 480 1 (0.4800000 0.5200000) *
There is a single split with split="anova"
:
> rpart(y~x,data = data,method = "anova",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000
node), split, n, deviance, yval
* denotes terminal node
1) root 1000 249.6000 0.5200000
2) x< 841.5 841 210.1831 0.5089180 *
3) x>=841.5 159 38.7673 0.5786164 *
As regards to why the split points can be in the same position, a couple of extract from the rpart documentation:
So it seems like in the case of two class problem the different measures may produce similar split points.