MultiIndex dataframes are very powerful but personally I think there is no enough (clear) documentations on it, specially for different type of slicing... Here is my question:
How to slice a multi-indexed dataframe just on one level with a list of labels? Please help me if you have a solution (without reseting indexes and converting the dataframe to single level index! Which is obvious and not efficient)
For example, we have following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=range(10))
df['id'] = pd.Series(range(10,20))
df['name'] = [f'name_{id}' for id in range(10,20)]
df['price'] = np.random.rand(df.index.size)
df['date'] = pd.date_range('20200310', '20200319')
df = df.set_index(['id', 'date'])
df
Slicing on one label is working just fine:
df.xs('2020-03-10', level='date', drop_level=False)
But how can we slice on a list of labels on that level?
df.xs(('2020-03-10', '2020-03-11', '2020-03-12'), level='date', drop_level=False)
This leads to an exception:
However Python doc says that "key" parameter could be a tuple as well:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.xs.html
For filter by multiple values use Index.get_level_values
with Index.isin
and boolean indexing
:
a = df[df.index.get_level_values('date').isin(('2020-03-10', '2020-03-11', '2020-03-12'))]
print (a)
name price
id date
10 2020-03-10 name_10 0.557772
11 2020-03-11 name_11 0.122315
12 2020-03-12 name_12 0.775976
However Python doc says that "key" parameter could be a tuple as well:
Tuple is possible use, but working differently - you can select by both labels like:
b = df.xs((10, '2020-03-10'), drop_level=False)
print (b)
name name_10
price 0.348808
Name: (10, 2020-03-10 00:00:00), dtype: object
c = df.xs((10, '2020-03-10'), level=('id','date'), drop_level=False)
print (c)
name price
id date
10 2020-03-10 name_10 0.239876
Like @yatu mentioned, another solution with IndexSlice
is with :
for all first levels and last :
for all columns:
df = df.loc[pd.IndexSlice[:, ['2020-03-10', '2020-03-11', '2020-03-12']], :]
print (df)
name price
id date
10 2020-03-10 name_10 0.557488
11 2020-03-11 name_11 0.592082
12 2020-03-12 name_12 0.547747