Search code examples
pythonpandasdataframemulti-index

how to filter part of a multi-index level dataframe with different conditions?


Here is an original dataframe,

                        country2
year                        2017     2018     2019
country1                       1   2    1   2    3   6
data_provider indicator
prov_1        ind_a           45  30   22  30   30  30
prov_2        ind_a           30  30   30  30   25  30
              ind_b           30  32   30  30   30  30
prov_3        ind_b           30  30   30  35   30  28

and I wish to filter the column and finally get a new dataframe,

# item                    country2
# year                        2017        2018     2019
# country1                       1     2     1   2    3   6
# data_provider indicator
# prov_1        ind_a         45.0   NaN  22.0 NaN  NaN NaN
# prov_2        ind_a          NaN   NaN   NaN NaN  NaN NaN
#               ind_b          NaN  32.0   NaN NaN  NaN NaN
# prov_3        ind_b          NaN   NaN   NaN NaN  NaN NaN

you can get the original dataframe by,

df = pd.DataFrame(
    data={"data_provider": ["prov_1", "prov_1", "prov_2", "prov_2", "prov_3", "prov_3"],
          "indicator": ["ind_a", "ind_a", "ind_a", "ind_b", "ind_b", "ind_b"],
          "unit": ["EUR", "EUR", "EUR", "EUR", "EUR", "EUR"],
          "year": ["2017", "2018","2019", "2017","2018","2019"],
          "country1": [1, 1, 3, 2, 2, 6],
          "country2": [45, 22, 25, 32, 35, 28]
          }
)

df = df.pivot_table(
    index=['data_provider', 'indicator'],
    columns=['year', 'country1'],
    fill_value=30
  )
df.columns.names = ['item', 'year', 'country1']

here is how I get the new dataframe,

  1. locate the 2 group of target column labels
    x1 = df.columns[df.columns.get_level_values(level='year')=='2017']
    x2 = df.columns[df.columns.get_level_values(level='year')=='2018']

  2. get newdf1 with condition1 df[x1]>30 newdf1 = df[df[x1] > 30]

  3. get newdf2 with condition2 df[x2]<30 newdf2 = df[df[x2] < 30]

  4. update newdf2 with newdf1

newdf = newdf2.combine_first(newdf1)

in my solution, I first get 2 dataframe after filtering the original dataframe with different conditions and then combine them together. I am wondering if there's a one direct way to achieve the goal.


Solution

  • import pandas as pd
    
    df = pd.DataFrame(
        data={
            "data_provider": ["prov_1", "prov_1", "prov_2", "prov_2", "prov_3", "prov_3"],
            "indicator": ["ind_a", "ind_a", "ind_a", "ind_b", "ind_b", "ind_b"],
            "unit": ["EUR", "EUR", "EUR", "EUR", "EUR", "EUR"],
            "year": ["2017", "2018","2019", "2017","2018","2019"],
            "country1": [1, 1, 3, 2, 2, 6],
            "country2": [45, 22, 25, 32, 35, 28]
        }
    )
    
    df = df.pivot_table(
        index=['data_provider', 'indicator'],
        columns=['year', 'country1'],
        fill_value=30
    )
    df.columns.names = ['item', 'year', 'country1']
    
    
    cond1 = (df.columns.get_level_values(level='year') == '2017') & df.gt(30)
    cond2 = (df.columns.get_level_values(level='year') == '2018') & df.lt(30)
        
    df_new = df.where(cond1 | cond2)
    print(df_new)
    

    Output :

                         country2                  
    year                        2017       2018    2019
    country1                       1   2    1   2   3   6
    data_provider indicator                           
    prov_1        ind_a         45.0 NaN  22.0 NaN NaN NaN
    prov_2        ind_a          NaN NaN   NaN NaN NaN NaN
                  ind_b          NaN NaN   NaN NaN NaN NaN
    prov_3        ind_b          NaN NaN   NaN NaN NaN NaN