Search code examples
pythonpandassortingmulti-index

How do you sort data with multiindex (columns) dataframe?


First, please forgive my bad vocabulary. Im still struggling with the correct terms, and have just discovered that I have created a multiindexed dataframe, which Im trying to learn how to manipulate.

The multiindex dataframe has 30 rows and 546 columns, and looks like a bigger version of this:

A B C D
aa bb cc aa bb cc aa bb cc aa bb cc
Date
2023-01-02 1 24 6 3 2 7 3 10 12 5 9 21
2023-01-03 1 23 7 3 4 6 3 9 13 6 10 22
2023-01-04 2 22 8 4 6 7 3 9 12 8 14 24
2023-01-05 3 21 10 3 8 6 4 8 11 10 12 21

The index is a timestamp date, and the top level (level 0?) column indexes A, B, C, D, etc each have the same 91 second level (level 1?) members: aa, bb, cc, etc

Since there are 546 columns in total, and 91 'level 1' columns, there must be 6 'level 0' columns. I cant see them cos the tables so big it just shows the first and last.

In reality, its a table of stock data pulled off yahoo where A, B, C are the (6) financial values like close, volume, high, etc and aa, bb, cc, etc are the (91) company codes.

Id like to learn how to do the following:

  1. How to pull off a list of the 'level 0' column names.

  2. How to pull off a list of the 'level 1' column names.

  3. For 1 row (date), pull out the data for ALL 'level 0' and ONE 'level 1' index. (For example, all financial data for one company on one day).

  4. For 1 row (date), ONE 'level 0' with ALL 'level 0' data. For example, volume data for all companies on one day.

Ive been trying things like:

df.loc[:,(['A','B'],['aa,'bb', 'cc'])]
df.loc['2023-01-02', :]

which work, but I cant sort the brackets and colons right to do the above stuff.

Also,

 df.loc[:,(['A','D'],['aa,'cc','ff'])]

and

df.loc['2023-01-05':,([A,C],[aa,dd])]

work, but

df.loc['2023-01-05',([A:],[aa,dd])]

and

df.loc['2023-01-05',(A:,[aa,dd])]

give invalid syntax. Can anyone explain, or maybe point me towards a tutorial that will help with the level definitions and round/square brackets and colons?

Thanks.


Solution

  • To pull a list of level column names, you can use get_level_values:

    df.columns.get_level_values(0)
    #Index(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D'], dtype='object')
    
    df.columns.get_level_values(1)
    #Index(['aa', 'bb', 'cc', 'aa', 'bb', 'cc', 'aa', 'bb', 'cc', 'aa', 'bb', 'cc'], dtype='object')
    
    df.columns.get_level_values(0).unique()
    #Index(['A', 'B', 'C', 'D'], dtype='object')
    
    df.columns.get_level_values(1).unique()
    #Index(['aa', 'bb', 'cc'], dtype='object')
    

    For 3 and 4, pd.IndexSlice would be convenient to use:

    # all level zero data for a specific level one index
    df.loc['2023-01-05', pd.IndexSlice[:, 'aa']]
    
    #A  aa     3
    #B  aa     3
    #C  aa     4
    #D  aa    10
    #Name: 2023-01-05, dtype: int64
    
    # all level one data for a specific level zero index
    
    df.loc['2023-01-05', pd.IndexSlice['A', :]]
    #A  aa     3
    #   bb    21
    #   cc    10
    #Name: 2023-01-05, dtype: int64