Search code examples
pandasmedian

Calculating the sum of rows in a column when other columns satisfy a criterion


I've got a dataframe that looks like this (code is at the bottom, if you care):

                A    B   C   D       E
Year Category                          
1980 Alpha     100  800  10  15  1.2350
     Bravo     200  700  20  25  2.3550
     Charlie   300  600  30  35  3.3600
     Delta     400  500  40  45  4.4448
1981 Alpha     500  400  50  55  5.5000
     Bravo     600  300  60  65  6.0000
     Charlie   700  200  70  75  7.2000
     Delta     800  100  80  85  8.0000

For each year, I'm trying to get the sum (and average) of columns A-D for all the rows where the value in E is larger than the median for that year.

So, for 1980 only Charlie & Delta qualify as having a value in column E that is greater than the median for that year. Hence the result for column A should be ...sum = 300 + 400 = 700 ... average = mean(300, 400) = 350

The closest I can get here is using groupby / median

df.groupby(['Year']).median()

...but then I'm not quite sure what to do with the outputs. This gives me the medians for each year but I can't then easily link back these results to my initial dataframe.

Dataframe code:

data = {'A': [100,200,300,400,500,600,700,800], 
        'B': [800,700,600,500,400,300,200,100], 
        'C': [10,20,30,40,50,60,70,80], 
        'D': [15,25,35,45,55,65,75,85], 
        'E': [1.235,2.355,3.36,4.4448,5.5,6,7.2,8]}

idx = pd.MultiIndex.from_product([[1980, 1981],['Alpha','Bravo', 'Charlie','Delta']], names=['Year','Category'])
df = pd.DataFrame(data, index=idx, columns=['A', 'B', 'C', 'D', 'E'])

Solution

  • IIUC, you can do a groupby on the first level and transform median on column E , then compare and drop off rows where E exceeds the median and sum on level=0:

    s = df['E'].gt(df.groupby(level=0)['E'].transform('median'))
    df.drop('E',1)[s].sum(level=0)
    

            A     B    C    D
    Year                      
    1980   700  1100   70   80
    1981  1500   300  150  160
    

    For having mean and sum stacked side by side you can do the same with concat and add_suffix to identify:

    s = df['E'].gt(df.groupby(level=0)['E'].transform('median'))
    m = df.drop('E',1)[s]
    pd.concat((m.sum(level=0).add_suffix('_sum'),m.mean(level=0).add_suffix('_avg')),axis=1)
    

          A_sum  B_sum  C_sum  D_sum  A_avg  B_avg  C_avg  D_avg
    Year                                                        
    1980    700   1100     70     80    350    550     35     40
    1981   1500    300    150    160    750    150     75     80