Search code examples
pandasdataframemulti-index

Issues in creating hierarchical columns in pandas


Just approached the hierarchical columns in pandas. The original dataframe (df) has 27 columns and looks like the following (Ticker is the index):

        Report Date   Shares        Gross Profit      ...
Ticker                                                                          
AAPL    2010-07-31    347000000.0   543000000.0       ...     
AAPL    2010-10-31    344000000.0   548000000.0       ...
AAPL    2011-01-31    347000000.0   556000000.0       ...
AAPL    2011-04-30    347000000.0   580000000.0       ...
AAPL    2011-07-31    348000000.0   591000000.0       ...

I'd like to modify the column structure so that the first level is Report Date, the second level is the columns Shares and Gross Profit. I tried to create a new dataframe with this structure for just one ticker (AAPL), this is the code I used:

col = pd.MultiIndex.from_product([df['Report Date'], df[['Shares', 'Gross Profit']]])
df1 = pd.DataFrame(df.loc['AAPL'], columns=col)

It seems working apparently, but there are just NaN:

Report Date 2010-07-31           2010-10-31               \
            Shares  Gross Profit Shares Gross Profit   
Ticker                                                                
AAPL        NaN     NaN          NaN          NaN   
AAPL        NaN     NaN          NaN          NaN   
AAPL        NaN     NaN          NaN          NaN   
AAPL        NaN     NaN          NaN          NaN   

Moreover, the shape exploded to (78, 112668). Can anybody spot the error? I guess it's in MultiIndex.from_product but cannot understand where.


Solution

  • Solution

    This question can be solved by df.melt() if we aim at producing the transposed version of the desired output first. You can easily set the double-leveled MultiIndex before df.transpose().

    df_want = df.melt(id_vars="Report Date", value_vars=["Shares", "Gross Profit"])\
        .sort_values(["Report Date", "variable"])\
        .set_index(["Report Date", "variable"])\
        .transpose()
    

    Result

    print(df_want)
    
    Report Date   2010-07-31               ...   2011-07-31             
    variable    Gross Profit       Shares  ... Gross Profit       Shares
    value        543000000.0  347000000.0  ...  591000000.0  348000000.0
    
    [1 rows x 10 columns]
    

    N.B. The problem of the original attempt: IMO a better data wrangling strategy is to let the desired indexes/columns be produced naturally within the data processing pipeline or set via standard Pandas APIs, especially when the names or indexes/columns are already present in the source dataframe.

    Edit: "Produce the desired index/columns naturally" means to NOT compute them outside the df.f1(...).f2(...).f3(...)... pipeline and assign the externally generated index/columns to the output DataFrame. Such approach can produce less error-prone and more maintainable code in general.

    In other words, manually generating indexes or column names is not likely a Pandastic way to go, except maybe for pre-allocation of empty dataframes.

    Generalization to Multiple Tickers

    I think there is a great chance that dealing with multiple tickers at once would be a realistic use case. So I also provide such a generalized version just in case. This solution is of course also applicable for single-ticker dataframes.

    Data

            Report Date   Shares        Gross Profit
    Ticker                                                              
    AAPL    2010-07-31    347000000.0   543000000.0   
    AAPL    2010-10-31    344000000.0   548000000.0 
    AAPL    2011-01-31    347000000.0   556000000.0 
    AAPL    2011-04-30    347000000.0   580000000.0 
    AAPL    2011-07-31    348000000.0   591000000.0
    GOOG    2011-07-31    448000004.0   691000000.0 
    GOOG    2010-07-31    447000004.0   643000000.0 
    GOOG    2010-10-31    444000004.0   648000000.0 
    GOOG    2011-01-31    447000004.0   656000000.0 
    GOOG    2011-04-30    447000004.0   680000000.0 
    

    Code

    df_want = df.reset_index()\
        .melt(id_vars=["Ticker", "Report Date"], value_vars=["Shares", "Gross Profit"])\
        .sort_values(["Ticker", "Report Date", "variable"])\
        .pivot(index="Ticker", columns=["Report Date", "variable"], values="value")
    

    Result

    print(df_want)
    
    Report Date   2010-07-31               ...   2011-07-31             
    variable    Gross Profit       Shares  ... Gross Profit       Shares
    Ticker                                 ...                          
    AAPL         543000000.0  347000000.0  ...  591000000.0  348000000.0
    GOOG         643000000.0  447000004.0  ...  691000000.0  448000004.0
    [2 rows x 10 columns]
    

    I am using pandas v1.1.3 and python 3.7 on a 64-bit debian 10 laptop.