Search code examples
pythonpandasdataframepivot-tablemulti-index

pandas dataframe groupby conditional count on multi-level column


Let's say we have dataframe like this

np.random.seed(123)
df = pd.DataFrame(np.random.randint(100,size=(4, 4)),columns = pd.MultiIndex.from_product([['exp0','exp1'],['rnd0','rnd1']],names=['experiments','rnd_runs']))

df['grp1','cat'] = ['A','A','B','B']
df['grp2','cat2'] = ['C','C','C','B']

experiments exp0            exp1            grp1 grp2
rnd_runs    rnd0    rnd1    rnd0    rnd1    cat cat2
0             66    92      98      17      A   C
1             83    57      86      97      A   C
2             96    47      73      32      B   C
3             46    96      25      83      B   B

I want to count values in ('exp0', 'rdn0') column with groupby ('grp1','cat')

So I tried;

df['exp0_cnt','rdn0'] = df.groupby([('grp1','cat')])[('exp0', 'rdn0')].apply(sum(x > 50 for x in df[(('exp0', 'rdn0'))]))

but getting the error

TypeError: other must be a MultiIndex or a list of tuples

Here are the similar posts and I think I'm doing multi-level column callings with tuples.

conditional on multi header pandas dataframe

pandas dataframe groupby on multiindex

Better way for creating columns in a multi level columns pandas dataframe


Solution

  • The only ways to select MultiIndex columns from a groupby is with a list of tuples or a MultiIndex (as indicated by the Error Message):

    So, instead of [('exp0', 'rdn0')] it needs to be [[('exp0', 'rnd0')]].

    df['exp0_cnt', 'rdn0'] = (
        df.groupby([('grp1', 'cat')])[[('exp0', 'rnd0')]]
            #                         ^ needs to be a list of tuples
            .transform(lambda x: x.gt(50).sum())  # Some function that works
    )
    

    *I've also changed the apply function, because it seems to be missing the lambda so I made a guess as to an equivalent:

    .apply(sum(x > 50 for x in df[(('exp0', 'rdn0'))])
    

    To transform since it's being assigned back to the DataFrame:

    .transform(lambda x: x.gt(50).sum())
    

    df:

    experiments exp0      exp1      grp1 grp2 exp0_cnt
    rnd_runs    rnd0 rnd1 rnd0 rnd1  cat cat2     rdn0
    0             66   92   98   17    A    C        2
    1             83   57   86   97    A    C        2  # 2 values over 50 (in group)
    2             96   47   73   32    B    C        1
    3             46   96   25   83    B    B        1  # 1 values over 50 (in group)
    

    Please Note: This means that a SeriesGroupBy cannot be created by selecting MultiIndex columns, only DataFrameGroupBy operations.

    type(df.groupby([('grp1', 'cat')])[[('exp0', 'rnd0')]])
    # <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
    

    This will exclude a few operations like SeriesGroupBy.unique

    df.groupby([('grp1', 'cat')])[[('exp0', 'rnd0')]].unique()
    
    AttributeError: 'DataFrameGroupBy' object has no attribute 'unique'
    

    However, we can force a SeriesGroupBy by Selecting the Series from the DataFrame and grouping by the Series values directly:

    df[('exp0', 'rnd0')].groupby(df[('grp1', 'cat')]).unique()
    #  ^ select specific column  ^ pass the Series to groupby directly
    
    (grp1, cat)
    A    [66, 83]
    B    [96, 46]
    Name: (exp0, rnd0), dtype: object