Search code examples
rrandom-forestgini

Why is Mean Decrease Gini in Random Forest dependent on population size?


I am using R package randomForest and to understand variable importance we can investigate varImpPlot which shows Mean decrease Gini. I have studied Random Forest in detail and am well aware of how this model works in detail, there is something that I am unable to completely understand regarding how Mean decrease Gini is calculated, or rather why it is dependent on the size of the population.

enter image description here

When we have calculated Gini index we are able to aggregate the mean decrease Gini by the following formula (divided by number of trees):

enter image description here

I understand that there will be more number of splits in each tree when having a larger population, but shouldn't these splits on average have very small decreases in Gini index?

Here is example code showing what I mean (as expected, number of trees does not affect mean decrese Gini but population has a huge effect and seems to be more or less linear with population size):

install.packages("randomForest")
library(randomForest)

set.seed(1)
a <- as.factor(c(rep(1, 20), rep(0, 30)))
b <- c(rnorm(20, 5, 2), rnorm(30, 4, 1))
c <- c(rnorm(25, 0, 1), rnorm(25, 1, 2))
data <- data.frame(a = a, b = b, c = c)

rf <- randomForest(data = data, a ~ b + c, importance = T, ntree = 300)
varImpPlot(rf)


a2 <- as.factor(c(rep(1, 200), rep(0, 300)))
b2 <- c(rnorm(200, 5, 2), rnorm(300, 4, 1))
c2 <- c(rnorm(250, 0, 1), rnorm(250, 1, 2))
data2 <- data.frame(a2 = a2, b2 = b2, c2 = c2)

rf2 <- randomForest(data = data2, a2 ~ b2 + c2, importance = T, ntree = 
300)
varImpPlot(rf2)


a3 <- as.factor(c(rep(1, 2000), rep(0, 3000)))
b3 <- c(rnorm(2000, 5, 2), rnorm(3000, 4, 1))
c3 <- c(rnorm(2500, 0, 1), rnorm(2500, 1, 2))
data3 <- data.frame(a3 = a3, b3 = b3, c3 = c3)

rf3 <- randomForest(data = data3, a3 ~ b3 + c3, importance = T, ntree = 
300)
varImpPlot(rf3)

Resulting in these following plots, where we see that the x-axis increases approximately 10x for each increase in population:

enter image description here

enter image description here

enter image description here

My guess is that there is a weight based on number of people in each split conducted That is, a split that is made in first nodes that splits 1000 people weights heavier than a split that is conducted further down the tree with say 10 people, I can't find this in any literature though since it seems that all calculations are made by taking fractions of population into regard rather than absolute numbers.

What is it that I am missing?


Solution

  • Your guess is correct.

    You have written down the definition of Gini impurity for a single split. Trees in a random forest are usually split multiple times. The higher nodes have more samples, and intuitively, are more "impure". So the formula for mean decrease in Gini takes the node sizes into account.

    So instead of

    Delta i(tau) = i(tau) - (n_l/n) i(tau_l) - (n_r/n) i(tau_r)
    

    the decrease in impurity is calculated as

    Delta i(tau) = n i(tau) - n_l i(tau_l) - n_r i(tau_r)
    

    That is, weight the impurities by the raw counts, not the proportions.

    The algorithm keeps splitting the tree to the maximum possible size (unless you specify the nodesize or maxnodes arguments). So a feature can be chosen for the splitting criterion several times. Its overall importance is the sum of the Deltas at those splits. This is the importance computation for one tree. Finally the importances are averaged across all the trees in the forest.

    Let's show this with a very contrived example.

    library("randomForest")
    #> randomForest 4.6-14
    #> Type rfNews() to see new features/changes/bug fixes.
    set.seed(1)
    
    n <- 1000
    # There are three classes in equal proportions
    a <- rep(c(-10,0,10), each = n)
    # One feature is useless
    b <- rnorm(3*n)
    # The other feature is highly predictive but we need at least two splits
    c <- rnorm(3*n, a)
    data <- data.frame(a = as.factor(a), b = b, c = c)
    
    # First let's do just one split, i.e., ask for just two terminal nodes
    
    # Expected MeanDecreaseGini:
    # With one split the best we can do is separate one class from the other two
    3000*(2/3) - 1000*0 - 2000*(1/2)
    #> [1] 1000
    
    # Actual MeanDecreaseGini
    rf3 <- randomForest(data = data, a ~ b + c, importance = TRUE,
                        ntree = 1000, mtry = 2, maxnodes = 2)
    rf3$importance[, "MeanDecreaseGini"]
    #>        b        c 
    #>    0.000 1008.754
    
    
    # Next let's do two splits; this is enough to separate classes perfectly
    
    # Expected MeanDecreaseGini:
    3000*(2/3) - 1000*0 - 2000*(1/2)  +   2000*(1/2) - 1000*0 - 1000*0
    #> [1] 2000
    
    # Actual MeanDecreaseGini
    rf3 <- randomForest(data = data, a ~ b + c, importance = TRUE,
                        ntree = 1000, mtry = 2, maxnodes = 3)
    rf3$importance[, "MeanDecreaseGini"]
    #>        b        c 
    #>    0.000 1999.333
    

    Created on 2019-03-08 by the reprex package (v0.2.1)

    PS: It is all well and good to know how importances are computed with the Gini criterion. But read this article for an explanation why you should use permutation importances instead: https://explained.ai/rf-importance/index.html