Search code examples
pythonpandaspandas-groupbymulti-index

Filter grouped Pandas data frame by column aggregate, when groups are from a MultiIndex level


How can I drop data from one level of a multi-level indexed data frame, based on aggregated information I get from a column within a groupby on that level?

For example, with data frame dfmi:

midx = pd.MultiIndex.from_product([['A0','A1','A2'], ['B0','B1','B2']], names=["index_1", "index_2"])
columns = ['foo', 'bar']
dfmi = pd.DataFrame(np.arange(18).reshape((len(midx), len(columns))),
                    index=midx, columns=columns)

dfmi
                 foo  bar
index_1 index_2          
A0      B0         0    1
        B1         2    3
        B2         4    5
A1      B0         6    7
        B1         8    9
        B2        10   11
A2      B0        12   13
        B1        14   15
        B2        16   17

Let's say I only want to keep levels of index_1 if the mean for foo exceeds a certain threshold.
Like this:

thresh = 5

for grp, data in dfmi.groupby("index_1"):
    print(data.foo.mean() > thresh)

False  <-- drop this level
True
True

Desired output:

                 foo  bar
index_1 index_2          
A1      B0         6    7
        B1         8    9
        B2        10   11
A2      B0        12   13
        B1        14   15
        B2        16   17

In this toy example I can get what I want with dfmi.loc[pd.IndexSlice["A1":"A2", :]]. But I can't figure out how to use IndexSlice or loc variants to do aggregations inside a grouped MultiIndex and then slice the full data frame based on the results.

My best solution so far is to just keep track of the level values that qualify as keepers (with grp), and then use the accumulated keepers collection with IndexSlice:

keepers = list()

for grp, data in dfmi.groupby("index_1"):
    if data.foo.mean() > thresh:
        keepers.append(grp)

dfmi.loc[pd.IndexSlice[keepers, :]]

I'm looking for a more efficient and/or more elegant way to accomplish that with native Pandas functionality.


Solution

  • You can use loc once you have created your mask, like so:

    mask = dfmi.groupby(level=0)['foo'].mean()>thresh
    
    dfmi.loc[mask.index[mask]]
    

    Yields:

    index_1 index_2          
    A1      B0         6    7
            B1         8    9
            B2        10   11
    A2      B0        12   13
            B1        14   15
            B2        16   17