Search code examples
pythonpandasdataframe

Pandas Dataframe Multiindex - Calculate Mean and add additional column to each level of the index


Given the following dataframe:

Year     2024          2023          2022
Header   N Result SD   N Result SD   N Result SD
Vendor
A        5 20     3    5 22     4    1 21     3
B        4 25     2    4 25     3    4 26     5
C        9 22     3    9 27     1    3 23     3   
D        3 23     5    3 16     2    5 13     4 
E        5 27     2    5 21     3    3 19     5

I would like to calculate for each year the mean value of the results column and then create a column, where the relative deviation to the mean is displayed (e.g. Results Value / mean-value * 100). The N and SD column were just included for completeness and is not needed for the calculation.

Year     2024                    2023                      2022
Header   N Result SD Deviation   N Result SD Deviation     N Result SD Deviation
Vendor
A        5 20     3  85.5        5 22     4  99.1          1 21     3  ..
B        4 25     2  106         4 25     3  113           4 26     5  ..
C        9 22     3  ..          9 27     1  ..            3 23     3  .. 
D        3 23     5  ..          3 16     2  ..            5 13     4  ..
E        5 27     2  ..          5 21     3  ..            3 19     5  ..

How what i be able to achieve that?

Thanks a lot in advance!


Solution

  • Use DataFrame.xs for select Result labels in MultiIndex, divide by mean and append to original in concat, last for correct position add DataFrame.sort_index with parameter sort_remaining=False:

    df1 = df.xs('Result', axis=1, level=1, drop_level=False)
    
    out = (pd.concat([df, 
                     df1.div(df1.mean()).mul(100)
                        .rename(columns={'Result':'Deviation'})], axis=1)
             .sort_index(axis=1, ascending=False, level=0, sort_remaining=False))
    print (out)
      2024                       2023                       2022            \
         N Result SD   Deviation    N Result SD   Deviation    N Result SD   
    A    5     20  3   85.470085    5     22  4   99.099099    1     21  3   
    B    4     25  2  106.837607    4     25  3  112.612613    4     26  5   
    C    9     22  3   94.017094    9     27  1  121.621622    3     23  3   
    D    3     23  5   98.290598    3     16  2   72.072072    5     13  4   
    E    5     27  2  115.384615    5     21  3   94.594595    3     19  5   
    
                   
        Deviation  
    A  102.941176  
    B  127.450980  
    C  112.745098  
    D   63.725490  
    E   93.137255  
    

    Another loop idea:

    for x in df.columns.levels[0]:
        df[(x, 'Deviation')] = df[(x, 'Result')].div(df[(x, 'Result')].mean()).mul(100)
    
    out = df.sort_index(axis=1, ascending=False, level=0, sort_remaining=False)
    print (out)
      2024                       2023                       2022            \
         N Result SD   Deviation    N Result SD   Deviation    N Result SD   
    A    5     20  3   85.470085    5     22  4   99.099099    1     21  3   
    B    4     25  2  106.837607    4     25  3  112.612613    4     26  5   
    C    9     22  3   94.017094    9     27  1  121.621622    3     23  3   
    D    3     23  5   98.290598    3     16  2   72.072072    5     13  4   
    E    5     27  2  115.384615    5     21  3   94.594595    3     19  5   
    
                   
        Deviation  
    A  102.941176  
    B  127.450980  
    C  112.745098  
    D   63.725490  
    E   93.137255