Search code examples
pythonpandasdataframedistribution

How to determine which row in dataframe has most even and highest distribution


I would like to sort a pandas dataframe by the rows which have the most even distribution but also high values. For example:

Row  Attribute1  Attribute2 Attribute3
a        1           1          108
b        10          2          145
c        50          60         55
d        100         90         120
e        20          25         23
f        1000        30         0

Rows d and c should rank the highest, ideally d followed by c.

I considered using standard deviation to identify the most even distribution and then mean to get the highest average values but I'm unsure as to how I can combine these together.


Solution

  • As the perception of "even distribution" you mention seems to be quite subjective, here is an instuction to implement the coefficient of variation mentionned by @ALollz.

    df.std(axis=1) / df.mean(axis=1)
    
    Row 0
    a   1.6848130582715446
    b   1.535375387727906
    c   0.09090909090909091
    d   0.14782502241793033
    e   0.11102697698927574
    f   1.6569547684031352
    

    This metrics is the percentage of the mean represented by the standard deviation. If you have a row mean of 10 and a standard deviation of 1, the ratio will be 10% or 0.1

    In this example, the row that could be considered most 'evenly distributed' is the row c: its mean is 55 and its standard deviation is 5. Therefore the ratio is about 9%.

    This way, you can have a decent overview of the homogeneity of the distribution.

    If you want the ranking, you can apply .sort_values:

    (df.std(axis=1) / df.mean(axis=1)).sort_values()
    Row 0
    c   0.09090909090909091
    e   0.11102697698927574
    d   0.14782502241793033
    b   1.535375387727906
    f   1.6569547684031352
    a   1.6848130582715446
    

    My last words would be to not be fooled by our brain's perception: it can be easily tricked by statistics.

    Now if you want to improve results of higher values, you can divide this coefficient by the mean: the higher the mean, the lower the coefficient.

    (df.std(axis=1) / df.mean(axis=1)**2).sort_values()
    
    Row 0
    d   0.0014305647330767452
    c   0.001652892561983471
    f   0.004826081849717869
    e   0.004898248984820989
    b   0.029338383204991835
    a   0.045949447043769395
    
    

    And now we obtain the desired ranking : d first, then c, f, e, b and a