Search code examples
pythonpandaspandas-groupbymulti-index

Remove index from MultiIndex dataframe if child index has column value meeting criteria


I had originally asked this question here, and I believe it was incorrectly marked as a duplicate. I will do my best here to clarify my question and how I believe it is unique.

Given the following example MultiIndex dataframe:

import pandas as pd
import numpy as np

first = ['A', 'B', 'C']
second = ['a', 'b', 'c', 'd']
third = ['1', '2', '3']

indices = [first, second, third]

index = pd.MultiIndex.from_product(indices, names=['first', 'second', 'third'])

df = pd.DataFrame(np.random.randint(10, size=(len(first)*len(second)*len(third), 4)), index=index, columns=['Val1','Val2',' Val3', 'Val4'])

Goal: I would like to retain a specific level=1 index (such as 'a') if the value of column 'Val2' corresponding to index value 1 in level=2 is greater than 5 for that level=1 index. Therefore, if this criteria is not met (i.e. column 'Val2' is less than or equal to 5 for index 1 in level=2), then the corresponding level=1 index would be removed from the dataframe. If all level=1 indices do not meet the criteria for a given level=0 index, then that level=0 index would also be removed. My previous post contains my expected output (I can add it here, but I wanted this post to be as succinct as possible for clarity).

Here is my current solution, the performance of which I'm sure can be improved:

grouped = df.groupby(level=0)

output = pd.concat([grouped.get_group(key).groupby(level=1).filter(lambda x: (x.loc[pd.IndexSlice[:, :, '1'], 'Val2']>5).any()) for key, group in grouped])

This does produce my desired output, but for a dataframe with 100,000's of rows, the performance is rather poor. Is there something obvious I am missing here to better utilize the under-the-hood optimization of pandas?


Solution

  • I got the same result as your example solution by doing the following:

    df.loc[df.xs('1', level=2)['Val2'] > 5]
    

    Comparing time performance this is ~15X faster (in my machine your example takes 36ms while this take 2ms).