Let's say we have dataframe like this
np.random.seed(123)
df = pd.DataFrame(np.random.randint(100,size=(4, 4)),columns = pd.MultiIndex.from_product([['exp0','exp1'],['rnd0','rnd1']],names=['experiments','rnd_runs']))
df['grp1','cat'] = ['A','A','B','B']
df['grp2','cat2'] = ['C','C','C','B']
experiments exp0 exp1 grp1 grp2
rnd_runs rnd0 rnd1 rnd0 rnd1 cat cat2
0 66 92 98 17 A C
1 83 57 86 97 A C
2 96 47 73 32 B C
3 46 96 25 83 B B
I want to count
values in ('exp0', 'rdn0')
column with groupby
('grp1','cat')
So I tried;
df['exp0_cnt','rdn0'] = df.groupby([('grp1','cat')])[('exp0', 'rdn0')].apply(sum(x > 50 for x in df[(('exp0', 'rdn0'))]))
but getting the error
TypeError: other must be a MultiIndex or a list of tuples
Here are the similar posts and I think I'm doing multi-level column callings with tuples
.
conditional on multi header pandas dataframe
pandas dataframe groupby on multiindex
Better way for creating columns in a multi level columns pandas dataframe
The only ways to select MultiIndex columns from a groupby is with a list of tuples or a MultiIndex (as indicated by the Error Message):
So, instead of [('exp0', 'rdn0')]
it needs to be [[('exp0', 'rnd0')]]
.
df['exp0_cnt', 'rdn0'] = (
df.groupby([('grp1', 'cat')])[[('exp0', 'rnd0')]]
# ^ needs to be a list of tuples
.transform(lambda x: x.gt(50).sum()) # Some function that works
)
*I've also changed the apply function, because it seems to be missing the lambda
so I made a guess as to an equivalent:
.apply(sum(x > 50 for x in df[(('exp0', 'rdn0'))])
To transform
since it's being assigned back to the DataFrame:
.transform(lambda x: x.gt(50).sum())
df
:
experiments exp0 exp1 grp1 grp2 exp0_cnt
rnd_runs rnd0 rnd1 rnd0 rnd1 cat cat2 rdn0
0 66 92 98 17 A C 2
1 83 57 86 97 A C 2 # 2 values over 50 (in group)
2 96 47 73 32 B C 1
3 46 96 25 83 B B 1 # 1 values over 50 (in group)
Please Note: This means that a SeriesGroupBy
cannot be created by selecting MultiIndex columns, only DataFrameGroupBy
operations.
type(df.groupby([('grp1', 'cat')])[[('exp0', 'rnd0')]])
# <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
This will exclude a few operations like SeriesGroupBy.unique
df.groupby([('grp1', 'cat')])[[('exp0', 'rnd0')]].unique()
AttributeError: 'DataFrameGroupBy' object has no attribute 'unique'
However, we can force a SeriesGroupBy
by Selecting the Series from the DataFrame and grouping by the Series values directly:
df[('exp0', 'rnd0')].groupby(df[('grp1', 'cat')]).unique()
# ^ select specific column ^ pass the Series to groupby directly
(grp1, cat)
A [66, 83]
B [96, 46]
Name: (exp0, rnd0), dtype: object