Search code examples
pythonpandasdataframepandas-groupbymulti-index

Pandas groupby with multiindex columns


My goal, the way I expected to achieve it, and what happens instead

I am trying to do a groupby on a DataFrame which has multiindex columns using a Series (without multiindex) as an input to group by. Specifically, given the below DataFrame

>>> df
            X        Y      
            A  B  C  A  B  C
2020-01-01  9  1  2  1  6  5
2020-01-02  5  7  8  0  6  9
2020-01-03  6  3  4  8  6  1
2020-01-06  0  0  9  0  5  1
2020-01-07  8  7  4  8  3  1

and the Series representing the groups

>>> groups
A    D
B    D
C    E
dtype: object

I try to run the following

>>> df.groupby(groups, axis=1, level=1).sum()

and expect to get

             X      Y   
             D  E   D  E
2020-01-01  10  2   7  5
2020-01-02  12  8   6  9
2020-01-03   9  4  14  1
2020-01-06   0  9   5  1
2020-01-07  15  4  11  1

Instead however I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/frame.py", line 6717, in groupby
    return DataFrameGroupBy(
  File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 560, in __init__
    grouper, exclusions, obj = get_grouper(
  File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 828, in get_grouper
    Grouping(
  File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 485, in __init__
    ) = index._get_grouper_for_level(self.grouper, level)
  File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/indexes/multi.py", line 1487, in _get_grouper_for_level
    grouper = level_values.map(mapper)
  File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 5098, in map
    new_values = super()._map_values(mapper, na_action=na_action)
  File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/base.py", line 937, in _map_values
    new_values = map_f(values, mapper)
  File "pandas/_libs/lib.pyx", line 2467, in pandas._libs.lib.map_infer
TypeError: 'numpy.ndarray' object is not callable

I'm using Python 3.8.8 and Pandas version 1.2.3.

A sub-optimal solution

One way I found to achieve the above is with the following code, but I'm specifically wondering whether there is a cleaner way to do so. If not, why not? To me the above attempt would be the expected behaviour of the groupby method, but it seems I'm misunderstanding the logic behind it.

>>> df, groups = df.align(groups, axis=1, level=1)
>>> df.groupby(groups, axis=1).apply(lambda x: x.sum(axis=1, level=0)).swaplevel(axis=1).sort_index(axis=1)
             X      Y   
             D  E   D  E
2020-01-01  10  2   7  5
2020-01-02  12  8   6  9
2020-01-03   9  4  14  1
2020-01-06   0  9   5  1
2020-01-07  15  4  11  1

Solution

  • You can use rename by second level of MultiIndex and then aggregate by both levels:

    df = df.rename(columns=groups, level=1).sum(axis=1, level=[0,1])
    
    #working like
    #df = df.rename(columns=groups, level=1).groupby(axis=1, level=[0,1]).sum()
    print (df)
                 X      Y   
                 D  E   D  E
    2020-01-01  10  2   7  5
    2020-01-02  12  8   6  9
    2020-01-03   9  4  14  1
    2020-01-06   0  9   5  1
    2020-01-07  15  4  11  1
    

    Your solution should be changed by lambda function, but output is different:

    df = df.groupby(lambda x: groups[x], axis=1, level=1).sum()
    print (df)
                 D   E
    2020-01-01  17   7
    2020-01-02  18  17
    2020-01-03  23   5
    2020-01-06   5  10
    2020-01-07  26   5