I am trying to understand multi-indexing. I have found some very good links (here by Jake VanderPlas and here by Nelson Minar) but I am not able to grasp the concept.
I do have some specific questions. Specifically, if we talk about this data -
import pandas as pd
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
Then:
health_data.loc[:,'Guido']
removes the top column
index whereas health_data.loc[:,['Guido']]
preserves it. health_data.loc[:, [('Bob', 'HR')]]
and health_data.loc[:,
('Bob', 'HR')]
work as intended (assuming answer to question 1 is
clear) but health_data.loc[:, ['Bob', 'HR']]
gives an extra
column.idx = pd.IndexSlice
then why health_data.loc[:,[idx['Bob','HR']]]
returns the intended output but health_data.loc[:,list(idx['Bob','HR'])]
returns the extra columnAs my questions arose from my lack of understanding of multi-index, any links which explains them in detail will be help as well. I have seen some of the SO questions and answers (this one helps a bit) but most of them are very specific and I could not find one that talks about in general about double bracket concept.
Q1: health_data.loc[:,'Guido']
selects a single column from the first level of the column multindex, whereas health_data.loc[:,['Guido']]
selects a list of columns from the same. I suppose the pandas devs could have chosen to make the 1-element list a special case, and drop the redundant dimension from the response, but they didn't.
In [6]: health_data.loc[:,'Guido']
Out[6]:
type HR Temp
year visit
2013 1 38.0 35.4
2 29.0 36.9
2014 1 41.0 36.6
2 26.0 36.9
In [7]: health_data.loc[:,['Guido']]
Out[7]:
subject Guido
type HR Temp
year visit
2013 1 38.0 35.4
2 29.0 36.9
2014 1 41.0 36.6
2 26.0 36.9
Q2: cases 1 and 2 select 1 column, first as a 1-element list of columns, then as a single column value (like cases 2 and 1 respectively of your Q1). case 3 is more interesting: it selects a list of items from the first level of the multiindex. Since only 'Bob' exists in that levels values, it returns all columns which have 'Bob' in the first level
In [29]: health_data.loc[:, [('Bob', 'HR')]]
Out[29]:
subject Bob
type HR
year visit
2013 1 29.0
2 47.0
2014 1 56.0
2 37.0
In [30]: health_data.loc[:, ('Bob', 'HR')]
Out[30]:
year visit
2013 1 29.0
2 47.0
2014 1 56.0
2 37.0
Name: (Bob, HR), dtype: float64
In [31]: health_data.loc[:, ['Bob', 'HR']]
Out[31]:
subject Bob
type HR Temp
year visit
2013 1 29.0 36.5
2 47.0 37.7
2014 1 56.0 38.5
2 37.0 37.3
Q3 this is essentially a rerun of Q2: idx['Bob','HR']
evaluates to ('Bob', 'HR')
which selects a single column from the multiindex. list(idx['Bob','HR'])
evaluates to ['Bob','HR']
which selects a list of columns from the 1st level of the multiindex.
In [18]: health_data.loc[:,[idx['Bob','HR']]]
Out[18]:
subject Bob
type HR
year visit
2013 1 29.0
2 47.0
2014 1 56.0
2 37.0
In [19]: health_data.loc[:,list(idx['Bob','HR'])]
Out[19]:
subject Bob
type HR Temp
year visit
2013 1 29.0 36.5
2 47.0 37.7
2014 1 56.0 38.5
2 37.0 37.3
Since 'HR' does not exist in the 1st level of the multi index, the slice selects just 'Bob'. You can see the difference more clearly if you substitute 'Sue' in place of 'HR':
In [27]: health_data.loc[:,[idx['Bob','Sue']]]
Out[27]:
subject Bob
type Sue
year visit
2013 1 NaN
2 NaN
2014 1 NaN
2 NaN
In [28]: health_data.loc[:,list(idx['Bob','Sue'])]
Out[28]:
subject Bob Sue
type HR Temp HR Temp
year visit
2013 1 29.0 36.5 35.0 36.4
2 47.0 37.7 36.0 36.4
2014 1 56.0 38.5 34.0 37.8
2 37.0 37.3 30.0 36.9
In the first case, we are selecting a column with the multiindex value of ('Bob', 'Sue')
which doesn't exist - hence the NaNs
. In the 2nd case we are selecting columns which contain all values from the list ['Bob', 'Sue']
in the first level of the multiindex. I must admit, it took me by surprise to see tuples and lists give different results here. Again I suppose this is an API decision...
The docs cover most of this:
https://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-indexing-with-hierarchical-index
https://pandas.pydata.org/pandas-docs/stable/advanced.html#using-slicers