Search code examples
pythonpandasdataframegroup-by

How to get sum at each column in multi column groups for pandas dataframe


Below is a dataframe.

  team task day  in  out
0    A    X   M  60   11
1    A    X   T  80   12
2    A    Y   M  91   31
3    A    Y   T  11   21
4    B    X   M  32   13
5    B    X   T  18   33
6    B    Z   M  28   42
7    B    Z   T  15   15

When groupby is performed with multiple columns, the output is as below.

df.groupby(['team', 'task','day']).sum()

               in  out
team task day         
A    X    M    60   11
          T    80   12
     Y    M    91   31
          T    11   21
B    X    M    32   13
          T    18   33
     Z    M    28   42
          T    15   15

In the same output, I am looking to get the sum for each column level i.e. at team, task level as well. As shown below.

               in  out
team task day         
A              242   75   <<<<<
A    X         140   23   <<<<<  
A    X    M    60   11
          T    80   12
A    Y         102   52   <<<<<
     Y    M    91   31
          T    11   21
B              93   103   <<<<<
B    X         50   46    <<<<<
B    X    M    32   13
          T    18   33
B    Z         43   57    <<<<
     Z    M    28   42
          T    15   15

I get the sum for these columns when performed the grouping separately.

df.groupby(['team', 'task','day']).sum() df.groupby(['team', 'task']).sum() df.groupby(['team']).sum()

team task day         
A    X    M    60   11
          T    80   12
     Y    M    91   31
          T    11   21
B    X    M    32   13
          T    18   33
     Z    M    28   42
          T    15   15
            in  out
team task          
A    X     140   23
     Y     102   52
B    X      50   46
     Z      43   57
       in  out
team          
A     242   75
B      93  103

I am looking for the combined output of sum done at each level as explained above.


Solution

  • This is a possible solution:

    a = df.groupby(['team', 'task','day']).sum()
    b = df.groupby(['team', 'task']).sum()
    c = df.groupby(['team']).sum()
    
    b['day'] = None
    c[['task', 'day']] = None
    
    a = a.reset_index()
    b = b.reset_index()
    c = c.reset_index()
    
    result = pd.concat(
        [a, b, c],
    ).set_index(
        ['team', 'task', 'day'],
    ).sort_index(
        na_position="first",
    )
    

    This is the output:

    >>> result
                    in  out
    team task day          
    A    NaN  NaN  242   75
         X    NaN  140   23
              M     60   11
              T     80   12
         Y    NaN  102   52
              M     91   31
              T     11   21
    B    NaN  NaN   93  103
         X    NaN   50   46
              M     32   13
              T     18   33
         Z    NaN   43   57
              M     28   42
              T     15   15