Search code examples
pythonmachine-learningcluster-analysisgini

How to calculate Gini coefficients for clustering


I have 5000 observations that are clustered into 10 clusters. Each cluster have 1000 true observations. The real life observations are 1000 in each cluster. However, after I have ran my clustering algorithm, it looks like this:

Cluster #, true members, clustered members
0,                 1000,               435
1,                 1000,               234
2,                 1000,               167
3,                 1000,               654
4,                 1000,                 0

In other words, cluster 0 should have 1000 members, but of those, only 435 are correctly added to that cluster by my algorithm. The difference between 5000 and the ones in the clusters are placed in the wrong cluster.

I would like to calculate the Gini coefficient, and have found the following code:

def gini_ind(Number, Total):
    return (1-(((Number/Total)**2)+(((Total-Number)/Total)**2)))

It seems to work fine on the tests I have tried. However, no dataset that I have found look like mine.

So my question is how do I calculate the Gini coefficient?

If I do the following, I get these Gini coefficients for each cluster:

gini_ind(435,1000) -> 0.49155
gini_ind(234,1000) -> 0.3584
gini_ind(167,1000) -> 0.2782
gini_ind(654,1000) -> 0.4525
gini_ind(0,1000) -> 0

Is that the correct Gini coefficient for each of the clusters?

And to get the average Gini coefficient; is that just the average: (0.49155+0.3584+0.2782+0.4525+0)/5 ?


Solution

  • Let's assume we have 3 classes and 80 objects. 19 objects are in class 1, 21 objects in class 2, and 40 objects in class 3 (denoted as (19,21,40) ).

    The Gini index would be: 1- [ (19/80)^2 + (21/80)^2 + (40/80)^2] = 0.6247 i.e. costbefore = Gini(19,21,40) = 0.6247

    In order to decide where to split, we test all possible splits. For example splitting at 2.0623, which results in a split (16,9,0) and (3,12,40):

    After testing x1 < 2.0623:

    costL =Gini(16,9,0) = 0.4608
    costR =Gini(3,12,40) = 0.4205
    

    Then we weight branch impurity by empirical branch probabilities:

    costx1<2.0623 = 25/80 costL + 55/80 costR = 0.4331
    

    We do that for every possible split, for example x1 < 1:

    costx1<1 = FractionL Gini(8,4,0) + FractionR Gini(11,17,40) = 12/80 * 0.4444 + 68/80 * 0.5653 = 0.5417
    

    After that, we chose the split with the lowest cost. This is the split x1 < 2.0623 with a cost of 0.4331.

    You may follow these below links.... http://dni-institute.in/blogs/gini-index-work-out-example/ http://stats.stackexchange.com/questions/95839/gini-decrease-and-gini-impurity-of-children-nodes