Search code examples
rsplitrpart

Diffrents between gini, information gain and sum of square of errors in rpart R


I made a short code in R to check how split criterias work. I got unexpected results, all of them choose the same value to split. Can someone explain it? Here is the code:

set.seed(1)
y <- sample(c(1, 0), 10000, replace = T)
x <- seq(1, 10000)
data <- data.frame(x, y)

library(rpart)
rpart(y~x,data = data,parms=list(split="gini"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
rpart(y~x,data = data,parms=list(split="information"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
rpart(y~x,data = data,method = "anova",control = list(maxdepth = 1,cp=0.0001,minsplit=1))

Solution

  • In my case only the last rpart command did split something:

    > set.seed(1)
    > y <- sample(c(1, 0), 1000, replace = T)
    > x <- seq(1, 1000)
    > data <- data.frame(x, y)
    > library(rpart)
    

    No split with split="gini":

    > rpart(y~x,data = data,parms=list(split="gini"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
    n= 1000 
    
    node), split, n, loss, yval, (yprob)
          * denotes terminal node
    
    1) root 1000 480 1 (0.4800000 0.5200000) *
    

    No split with split="information":

    > rpart(y~x,data = data,parms=list(split="information"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
    n= 1000 
    
    node), split, n, loss, yval, (yprob)
          * denotes terminal node
    
    1) root 1000 480 1 (0.4800000 0.5200000) *
    

    There is a single split with split="anova":

    > rpart(y~x,data = data,method = "anova",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
    n= 1000 
    
    node), split, n, deviance, yval
          * denotes terminal node
    
    1) root 1000 249.6000 0.5200000  
      2) x< 841.5 841 210.1831 0.5089180 *
      3) x>=841.5 159  38.7673 0.5786164 *
    

    As regards to why the split points can be in the same position, a couple of extract from the rpart documentation:

    • Gini measure vs. Information impurity (page 6): "For the two class problem the measures differ only slightly, and will nearly always choose the same split point."
    • Gini measure vs. [ANalysis Of] Variances (page 41): "... for the two class case the Gini splitting rule reduces to 2p(1 − p), which is the variance of a node."

    So it seems like in the case of two class problem the different measures may produce similar split points.