Search code examples
pythonpandasdataframemulti-index

How to query a MultiIndex by MultiIndex and choose the "best" row?


Say I have a MultiIndex by MultiIndex DataFrame similar to the one generated here (in the real use case the list of races is dynamic and not known ahead of time):

import random

import pandas as pd

random.seed(1)

data_frame_rows = pd.MultiIndex.from_arrays([[], [], []], names=("car", "engine", "wheels"))
data_frame_columns = pd.MultiIndex.from_arrays([[], [], []], names=("group", "subgroup", "details"))
data_frame = pd.DataFrame(index=data_frame_rows, columns=data_frame_columns)

for car in ("mustang", "corvette", "civic"):
    for engine in ("normal", "supercharged"):
        for wheels in ("normal", "wide"):
            data_frame.loc[(car, engine, wheels), ("cost", "", "money ($)")] = int(random.random() * 100)
            data_frame.loc[(car, engine, wheels), ("cost", "", "maintenance (minutes)")] = int(random.random() * 60)

            for race in ("f1", "indy", "lemans"):
                percent_win = random.random()
                recommended = percent_win >= 0.8
                data_frame.loc[(car, engine, wheels), ("race", race, "win %")] = percent_win
                data_frame.loc[(car, engine, wheels), ("race", race, "recommended")] = recommended

Which then will look something like:

group                             cost                            race                                                        
subgroup                                                            f1                  indy                lemans            
details                      money ($) maintenance (minutes)     win % recommended     win % recommended     win % recommended
car      engine       wheels                                                                                                  
mustang  normal       normal      13.0                  50.0  0.763775       False  0.255069       False  0.495435       False
                      wide        44.0                  39.0  0.788723       False  0.093860       False  0.028347       False
         supercharged normal      83.0                  25.0  0.762280       False  0.002106       False  0.445387       False
                      wide        72.0                  13.0  0.945271        True  0.901427        True  0.030590       False
corvette normal       normal       2.0                  32.0  0.939149        True  0.381204       False  0.216599       False
                      wide        42.0                   1.0  0.221692       False  0.437888       False  0.495812       False
         supercharged normal      23.0                  13.0  0.218781       False  0.459603       False  0.289782       False
                      wide         2.0                  50.0  0.556454       False  0.642294       False  0.185906       False
civic    normal       normal      99.0                  51.0  0.120890       False  0.332695       False  0.721484       False
                      wide        71.0                  56.0  0.422107       False  0.830036        True  0.670306       False
         supercharged normal      30.0                  35.0  0.882479        True  0.846197        True  0.505284       False
                      wide        58.0                   2.0  0.242740       False  0.797404       False  0.414314       False

I now want to find all rows where the particular car configuration (engine and wheel combination) is the "best" configuration for that car. For example, in this case the civic has two configurations that are recommended, but the civic with a supercharged engine and normal wheels has the highest chance of winning a race (88% in the f1 race). All other cars/configurations that aren't recommended for any race, or have a lower chance of winning any race, I want filtered out. The mustang and the corvette have one configuration each which is recommended for any race, so these are the configurations I would choose for those two cars.

So the final output would be each car listed at most once, with the best configuration. If a car has no recommended configurations then I want it out completely.

I've read this multiple times and for the life of me I can't figure it out.

As a starting point to just get the recommended rows I tried something like:

data_frame[(data_frame.loc[:,idx["race",:,"recommended"]]==True)]

But that doesn't seem to filter the rows, just sets things to either NaN or True

group                             cost                        race                                                 
subgroup                                                        f1              indy             lemans            
details                      money ($) maintenance (minutes) win % recommended win % recommended  win % recommended
car      engine       wheels                                                                                       
mustang  normal       normal       NaN                   NaN   NaN         NaN   NaN         NaN    NaN         NaN
                      wide         NaN                   NaN   NaN         NaN   NaN         NaN    NaN         NaN
         supercharged normal       NaN                   NaN   NaN         NaN   NaN         NaN    NaN         NaN
                      wide         NaN                   NaN   NaN        True   NaN        True    NaN         NaN
corvette normal       normal       NaN                   NaN   NaN        True   NaN         NaN    NaN         NaN
                      wide         NaN                   NaN   NaN         NaN   NaN         NaN    NaN         NaN
         supercharged normal       NaN                   NaN   NaN         NaN   NaN         NaN    NaN         NaN
                      wide         NaN                   NaN   NaN         NaN   NaN         NaN    NaN         NaN
civic    normal       normal       NaN                   NaN   NaN         NaN   NaN         NaN    NaN         NaN
                      wide         NaN                   NaN   NaN         NaN   NaN        True    NaN         NaN
         supercharged normal       NaN                   NaN   NaN        True   NaN        True    NaN         NaN
                      wide         NaN                   NaN   NaN         NaN   NaN         NaN    NaN         NaN

Solution

  • Selecting multi-index columns can be complicated and has to be be precise. The code below gives the 4 rows with True in any recommended column as required:

    selection = data_frame[data_frame.loc[:, ('race', (slice(None)), 'recommended')].any(axis = 1)]
    
    print(selection) 
    

    Essentially this is selecting only columns with recommended then combining the True values from these into a mask using any() and finally using this to select the needed rows from the full DF. == True is not needed as assumed but would be accepted. (slice(None)) is a convenient way to select all columns at Level1 within race at Level 0; alternatively you could replace this with (['f1', 'indy', 'lemans']) but can't use : in this position. The output is as expected:

    enter image description here

    From here you can further select as required.

    The confusion with all the Nan values in your almost-there code was essentially due to not combining several columns to produce a single mask column.