Search code examples
pythonpandasdataframemulti-index

How to combine boolean indexer with multi-index in pandas?


I have a multi-indexed dataframe and I wish to extract a subset based on index values and on a boolean criteria. I wish to overwrite the values of a specific new values using multi-index keys and boolean indexers to select the records to modify.

import pandas as pd 
import numpy as np

years        = [1994,1995,1996]
householdIDs = [ id for id in range(1,100) ]

midx = pd.MultiIndex.from_product( [years, householdIDs], names = ['Year', 'HouseholdID'] )

householdIncomes = np.random.randint( 10000,100000, size = len(years)*len(householdIDs) )
householdSize    = np.random.randint( 1,5, size = len(years)*len(householdIDs) )
df = pd.DataFrame( {'HouseholdIncome':householdIncomes, 'HouseholdSize':householdSize}, index = midx ) 
df.sort_index(inplace = True)

Here's what the sample data looks like...

  df.head()
=>                   HouseholdIncome  HouseholdSize
Year HouseholdID                                
1994 1                      23866              3
     2                      57956              3
     3                      21644              3
     4                      71912              4
     5                      83663              3

I'm able to successfully query the dataframe using the indices and column labels.

This example gives me the HouseholdSize for household 3 in year 1996

   df.loc[  (1996,3 ) , 'HouseholdSize' ]
=> 1

However, I'm unable to combine boolean selection with multi-index queries...

The pandas docs on Multi-indexing says there is a way to combine boolean indexing with multi-indexing and gives an example...

In [52]: idx = pd.IndexSlice
In [56]: mask = dfmi[('a','foo')]>200

In [57]: dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]
Out[57]: 
lvl0           a    b
lvl1         foo  foo
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

...which I can't seem to replicate on my dataframe

    idx = pd.IndexSlice
    housholdSizeAbove2 = ( df.HouseholdSize > 2 )
    df.loc[ idx[ housholdSizeAbove2, 1996, :] , 'HouseholdSize' ] 
Traceback (most recent call last):
  File "python", line 1, in <module>
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (3), lexsort depth (2)'

In this example I would want to see all the households in 1996 with householdsize above 2


Solution

  • Pandas.query() should work in this case:

    df.query("Year == 1996 and HouseholdID > 2")
    

    Demo:

    In [326]: with pd.option_context('display.max_rows',20):
         ...:     print(df.query("Year == 1996 and HouseholdID > 2"))
         ...:
                      HouseholdIncome  HouseholdSize
    Year HouseholdID
    1996 3                      28664              4
         4                      11057              1
         5                      36321              2
         6                      89469              4
         7                      35711              2
         8                      85741              1
         9                      34758              3
         10                     56085              2
         11                     32275              4
         12                     77096              4
    ...                           ...            ...
         90                     40276              4
         91                     10594              2
         92                     61080              4
         93                     65334              2
         94                     21477              4
         95                     83112              4
         96                     25627              2
         97                     24830              4
         98                     85693              1
         99                     84653              4
    
    [97 rows x 2 columns]
    

    UPDATE:

    Is there a way to select a specific column?

    In [333]: df.loc[df.eval("Year == 1996 and HouseholdID > 2"), 'HouseholdIncome']
    Out[333]:
    Year  HouseholdID
    1996  3              28664
          4              11057
          5              36321
          6              89469
          7              35711
          8              85741
          9              34758
          10             56085
          11             32275
          12             77096
                         ...
          90             40276
          91             10594
          92             61080
          93             65334
          94             21477
          95             83112
          96             25627
          97             24830
          98             85693
          99             84653
    Name: HouseholdIncome, dtype: int32
    

    and ultimately I want to overwrite the data on the dataframe.

    In [331]: df.loc[df.eval("Year == 1996 and HouseholdID > 2"), 'HouseholdSize'] *= 10
    
    In [332]: df.loc[df.eval("Year == 1996 and HouseholdID > 2")]
    Out[332]:
                      HouseholdIncome  HouseholdSize
    Year HouseholdID
    1996 3                      28664             40
         4                      11057             10
         5                      36321             20
         6                      89469             40
         7                      35711             20
         8                      85741             10
         9                      34758             30
         10                     56085             20
         11                     32275             40
         12                     77096             40
    ...                           ...            ...
         90                     40276             40
         91                     10594             20
         92                     61080             40
         93                     65334             20
         94                     21477             40
         95                     83112             40
         96                     25627             20
         97                     24830             40
         98                     85693             10
         99                     84653             40
    
    [97 rows x 2 columns]
    

    UPDATE2:

    I want to pass a variable year instead of a specific value. Is there a cleaner way to do it than Year == " + str(year) + " and HouseholdID > " + str(householdSize) ?

    In [5]: year = 1996
    
    In [6]: household_ids = [1, 2, 98, 99]
    
    In [7]: df.loc[df.eval("Year == @year and HouseholdID in @household_ids")]
    Out[7]:
                      HouseholdIncome  HouseholdSize
    Year HouseholdID
    1996 1                      42217              1
         2                      66009              3
         98                     33121              4
         99                     45489              3