How can I drop data from one level of a multi-level indexed data frame, based on aggregated information I get from a column within a groupby on that level?
For example, with data frame dfmi
:
midx = pd.MultiIndex.from_product([['A0','A1','A2'], ['B0','B1','B2']], names=["index_1", "index_2"])
columns = ['foo', 'bar']
dfmi = pd.DataFrame(np.arange(18).reshape((len(midx), len(columns))),
index=midx, columns=columns)
dfmi
foo bar
index_1 index_2
A0 B0 0 1
B1 2 3
B2 4 5
A1 B0 6 7
B1 8 9
B2 10 11
A2 B0 12 13
B1 14 15
B2 16 17
Let's say I only want to keep levels of index_1
if the mean for foo
exceeds a certain threshold.
Like this:
thresh = 5
for grp, data in dfmi.groupby("index_1"):
print(data.foo.mean() > thresh)
False <-- drop this level
True
True
Desired output:
foo bar
index_1 index_2
A1 B0 6 7
B1 8 9
B2 10 11
A2 B0 12 13
B1 14 15
B2 16 17
In this toy example I can get what I want with dfmi.loc[pd.IndexSlice["A1":"A2", :]]
. But I can't figure out how to use IndexSlice
or loc
variants to do aggregations inside a grouped MultiIndex and then slice the full data frame based on the results.
My best solution so far is to just keep track of the level values that qualify as keepers (with grp
), and then use the accumulated keepers
collection with IndexSlice
:
keepers = list()
for grp, data in dfmi.groupby("index_1"):
if data.foo.mean() > thresh:
keepers.append(grp)
dfmi.loc[pd.IndexSlice[keepers, :]]
I'm looking for a more efficient and/or more elegant way to accomplish that with native Pandas functionality.
You can use loc
once you have created your mask, like so:
mask = dfmi.groupby(level=0)['foo'].mean()>thresh
dfmi.loc[mask.index[mask]]
Yields:
index_1 index_2
A1 B0 6 7
B1 8 9
B2 10 11
A2 B0 12 13
B1 14 15
B2 16 17