Search code examples
pandasaveragepercentile

Calculate the average of the lowest n percentile


I have the following dataset. I want to find the average run of the lower 20 percentile. For example: If I divide the runs column into 5 batches then the first two rows will be in the 20 percentile. So the average run of these two rows will be (1+2)/2 = 1.5 How do I divide the data frame into 5 batches (with sorting) and then find the average of that specific group?

I have tried using the following but the output shows 2.8 instead of 3

d.runs.quantile(0.2)

Input:


ODI_runs = {'name': ['Tendulkar', 'Sangakkara', 'Ponting', 
                      'Jayasurya', 'Jayawardene', 'Kohli', 
                      'Haq', 'Kallis', 'Ganguly', 'Dravid'], 
            'runs': [1,2,3,4,5,6,7,8,9,10]} 
d = pd.DataFrame(ODI_runs)  

name            runs
Tendulkar       1
Sangakkara      2
Ponting         3
Jayasurya       4
Jayawardene     5
Kohli           6
Haq             7
Kallis          8
Ganguly         9
Dravid          10

Output:

1.5

Solution

  • You could use the pandas.DataFrame.quantile method: to retrieve the value that separates the first 20% of the data we use df["runs"].quantile(0.2). Then, is all pandas: use loc to target the correct rows and columns, and calculate the .mean() of thos values:

    >> df.loc[df["runs"] <= df["runs"].quantile(0.2), "runs"].mean()
    1.5