Search code examples
pandasdataframemulti-index

why can't I select inner level index elements in a dataframe?


I have this dataframe

Out[108]: 
                         workingGasVolume  gasInStorage
country gasDayStartedOn                                
AT      2017-01-01                94.6115       53.8217
        2017-02-01                94.6115       34.1189
        2017-03-01                94.6115       20.1395
        2017-04-01                91.7617       14.2003
        2017-05-01                91.7617       17.0906
                                  ...           ...
UA      2021-01-01               320.0524      196.7549
        2021-02-01               318.0987      168.2637
        2021-03-01               298.3932      129.6654
        2021-04-01               320.1722      115.9547
        2021-05-01               320.1562      108.9367

[954 rows x 2 columns]

with this index

df_storage_by_country_month.index
Out[109]: 
MultiIndex([('AT', '2017-01-01'),
            ('AT', '2017-02-01'),
            ('AT', '2017-03-01'),
            ('AT', '2017-04-01'),
            ('AT', '2017-05-01'),
            ('AT', '2017-06-01'),
            ('AT', '2017-07-01'),
            ('AT', '2017-08-01'),
            ('AT', '2017-09-01'),
            ('AT', '2017-10-01'),
            ...
            ('UA', '2020-08-01'),
            ('UA', '2020-09-01'),
            ('UA', '2020-10-01'),
            ('UA', '2020-11-01'),
            ('UA', '2020-12-01'),
            ('UA', '2021-01-01'),
            ('UA', '2021-02-01'),
            ('UA', '2021-03-01'),
            ('UA', '2021-04-01'),
            ('UA', '2021-05-01')],
           names=['country', 'gasDayStartedOn'], length=954)

when I followed the example in Python guideline with below code, my objective is to see all the data on that day from different countries

df_storage_by_country_month.loc[:,'2017-01-01']

I have

Traceback
KeyError: '2017-01-01'

why?


UPDATE 2021/06/01

I don't understand the logic and consistency of Pandas on this point for 2 reasons:

  1. shouldn't the inner and outer level of index be treated equally? it's weird that we can access the outer level index like this(example: df.loc['cobra'] from the official docs: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) but not the inner level,

  2. Why we can proceed this way for both inner and outer level index for a Series but NOT for a dataframe? Example:

    data = pd.Series(np.random.randn(9), index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 3, 1, 2, 2, 3]])

    data.loc[:, 2]

    Out[15]: a 0.478943 c 0.092908 d 0.281746 dtype: float64


Solution

  • Because when you do .loc[:, 'something'] that something is always interpreted as column names.

    Try pd.IndexSlice:

    df.loc[pd.IndexSlice[:, '2017-01-01']]
    

    Or query:

    df.query('gasDayStartedOn == "2017-01-01"')