First, please forgive my bad vocabulary. Im still struggling with the correct terms, and have just discovered that I have created a multiindexed dataframe, which Im trying to learn how to manipulate.
The multiindex dataframe has 30 rows and 546 columns, and looks like a bigger version of this:
A | B | C | D | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
aa | bb | cc | aa | bb | cc | aa | bb | cc | aa | bb | cc | |
Date | ||||||||||||
2023-01-02 | 1 | 24 | 6 | 3 | 2 | 7 | 3 | 10 | 12 | 5 | 9 | 21 |
2023-01-03 | 1 | 23 | 7 | 3 | 4 | 6 | 3 | 9 | 13 | 6 | 10 | 22 |
2023-01-04 | 2 | 22 | 8 | 4 | 6 | 7 | 3 | 9 | 12 | 8 | 14 | 24 |
2023-01-05 | 3 | 21 | 10 | 3 | 8 | 6 | 4 | 8 | 11 | 10 | 12 | 21 |
The index is a timestamp date, and the top level (level 0?) column indexes A, B, C, D, etc each have the same 91 second level (level 1?) members: aa, bb, cc, etc
Since there are 546 columns in total, and 91 'level 1' columns, there must be 6 'level 0' columns. I cant see them cos the tables so big it just shows the first and last.
In reality, its a table of stock data pulled off yahoo where A, B, C are the (6) financial values like close, volume, high, etc and aa, bb, cc, etc are the (91) company codes.
Id like to learn how to do the following:
How to pull off a list of the 'level 0' column names.
How to pull off a list of the 'level 1' column names.
For 1 row (date), pull out the data for ALL 'level 0' and ONE 'level 1' index. (For example, all financial data for one company on one day).
For 1 row (date), ONE 'level 0' with ALL 'level 0' data. For example, volume data for all companies on one day.
Ive been trying things like:
df.loc[:,(['A','B'],['aa,'bb', 'cc'])]
df.loc['2023-01-02', :]
which work, but I cant sort the brackets and colons right to do the above stuff.
Also,
df.loc[:,(['A','D'],['aa,'cc','ff'])]
and
df.loc['2023-01-05':,([A,C],[aa,dd])]
work, but
df.loc['2023-01-05',([A:],[aa,dd])]
and
df.loc['2023-01-05',(A:,[aa,dd])]
give invalid syntax. Can anyone explain, or maybe point me towards a tutorial that will help with the level definitions and round/square brackets and colons?
Thanks.
To pull a list of level column names, you can use get_level_values
:
df.columns.get_level_values(0)
#Index(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D'], dtype='object')
df.columns.get_level_values(1)
#Index(['aa', 'bb', 'cc', 'aa', 'bb', 'cc', 'aa', 'bb', 'cc', 'aa', 'bb', 'cc'], dtype='object')
df.columns.get_level_values(0).unique()
#Index(['A', 'B', 'C', 'D'], dtype='object')
df.columns.get_level_values(1).unique()
#Index(['aa', 'bb', 'cc'], dtype='object')
For 3 and 4, pd.IndexSlice
would be convenient to use:
# all level zero data for a specific level one index
df.loc['2023-01-05', pd.IndexSlice[:, 'aa']]
#A aa 3
#B aa 3
#C aa 4
#D aa 10
#Name: 2023-01-05, dtype: int64
# all level one data for a specific level zero index
df.loc['2023-01-05', pd.IndexSlice['A', :]]
#A aa 3
# bb 21
# cc 10
#Name: 2023-01-05, dtype: int64