Search code examples
pandasmulti-index

Understanding pandas multi-index and double brackets


I am trying to understand multi-indexing. I have found some very good links (here by Jake VanderPlas and here by Nelson Minar) but I am not able to grasp the concept.

I do have some specific questions. Specifically, if we talk about this data -

import pandas as pd

# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)

Then:

  1. Why/how does health_data.loc[:,'Guido'] removes the top column index whereas health_data.loc[:,['Guido']] preserves it.
  2. Why do health_data.loc[:, [('Bob', 'HR')]] and health_data.loc[:, ('Bob', 'HR')] work as intended (assuming answer to question 1 is clear) but health_data.loc[:, ['Bob', 'HR']] gives an extra column.
  3. If I define idx = pd.IndexSlice then why health_data.loc[:,[idx['Bob','HR']]] returns the intended output but health_data.loc[:,list(idx['Bob','HR'])] returns the extra column

As my questions arose from my lack of understanding of multi-index, any links which explains them in detail will be help as well. I have seen some of the SO questions and answers (this one helps a bit) but most of them are very specific and I could not find one that talks about in general about double bracket concept.


Solution

  • Q1: health_data.loc[:,'Guido'] selects a single column from the first level of the column multindex, whereas health_data.loc[:,['Guido']] selects a list of columns from the same. I suppose the pandas devs could have chosen to make the 1-element list a special case, and drop the redundant dimension from the response, but they didn't.

    In [6]: health_data.loc[:,'Guido']
    Out[6]: 
    type          HR  Temp
    year visit            
    2013 1      38.0  35.4
         2      29.0  36.9
    2014 1      41.0  36.6
         2      26.0  36.9
    
    In [7]: health_data.loc[:,['Guido']]
    Out[7]: 
    subject    Guido      
    type          HR  Temp
    year visit            
    2013 1      38.0  35.4
         2      29.0  36.9
    2014 1      41.0  36.6
         2      26.0  36.9
    

    Q2: cases 1 and 2 select 1 column, first as a 1-element list of columns, then as a single column value (like cases 2 and 1 respectively of your Q1). case 3 is more interesting: it selects a list of items from the first level of the multiindex. Since only 'Bob' exists in that levels values, it returns all columns which have 'Bob' in the first level

    In [29]: health_data.loc[:, [('Bob', 'HR')]]
    Out[29]: 
    subject      Bob
    type          HR
    year visit      
    2013 1      29.0
         2      47.0
    2014 1      56.0
         2      37.0
    
    In [30]: health_data.loc[:, ('Bob', 'HR')]
    Out[30]: 
    year  visit
    2013  1        29.0
          2        47.0
    2014  1        56.0
          2        37.0
    Name: (Bob, HR), dtype: float64
    
    In [31]: health_data.loc[:, ['Bob', 'HR']]
    Out[31]: 
    subject      Bob      
    type          HR  Temp
    year visit            
    2013 1      29.0  36.5
         2      47.0  37.7
    2014 1      56.0  38.5
         2      37.0  37.3
    

    Q3 this is essentially a rerun of Q2: idx['Bob','HR'] evaluates to ('Bob', 'HR') which selects a single column from the multiindex. list(idx['Bob','HR']) evaluates to ['Bob','HR'] which selects a list of columns from the 1st level of the multiindex.

    In [18]: health_data.loc[:,[idx['Bob','HR']]]
    Out[18]: 
    subject      Bob
    type          HR
    year visit      
    2013 1      29.0
         2      47.0
    2014 1      56.0
         2      37.0
    
    In [19]: health_data.loc[:,list(idx['Bob','HR'])]
    Out[19]: 
    subject      Bob      
    type          HR  Temp
    year visit            
    2013 1      29.0  36.5
         2      47.0  37.7
    2014 1      56.0  38.5
         2      37.0  37.3
    

    Since 'HR' does not exist in the 1st level of the multi index, the slice selects just 'Bob'. You can see the difference more clearly if you substitute 'Sue' in place of 'HR':

    In [27]: health_data.loc[:,[idx['Bob','Sue']]]
    Out[27]: 
    subject    Bob
    type       Sue
    year visit    
    2013 1     NaN
         2     NaN
    2014 1     NaN
         2     NaN
    
    In [28]: health_data.loc[:,list(idx['Bob','Sue'])]
    Out[28]: 
    subject      Bob         Sue      
    type          HR  Temp    HR  Temp
    year visit                        
    2013 1      29.0  36.5  35.0  36.4
         2      47.0  37.7  36.0  36.4
    2014 1      56.0  38.5  34.0  37.8
         2      37.0  37.3  30.0  36.9
    

    In the first case, we are selecting a column with the multiindex value of ('Bob', 'Sue') which doesn't exist - hence the NaNs. In the 2nd case we are selecting columns which contain all values from the list ['Bob', 'Sue'] in the first level of the multiindex. I must admit, it took me by surprise to see tuples and lists give different results here. Again I suppose this is an API decision...

    The docs cover most of this:

    https://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-indexing-with-hierarchical-index

    https://pandas.pydata.org/pandas-docs/stable/advanced.html#using-slicers