Search code examples
pandasreshapemulti-index

Reshape data with double-string headers in pandas


I have the following dataframe:

                    AAPL,Price AAPL,Volume  ...  GOOG,Ivol  GOOG,Shares
  Date                                                                
0 2019-12-25              21.5        1879            20.0         2010                             
1 2019-12-26              22.1        1887            19.9         2000                              
2 2019-12-27              23.0        1888            19.9         2045       
3 2019-12-30              22.3        1887            NaN          2050                                
4 2019-12-31              22.4        1900            20.1         1998       

I will need to run regressions for every one of these stocks features, and I will have to attach a set of specific dummies. Therefore, my aim is to reshape the database such that I have a double index made of Date in first place and Stock name in second place, i.e.

Date        Stock  Price  Volume ... Ivol Shares
2019-12-25  AAPL    21.5    1879 ... 22.1   3121
...         ...    ...    ...    ... ...  ... 
2019-12-25  GOOG    45.8    NaN  ... 20.0   2000 
...
2019-12-25   VER     NaN    NaN  ...  NaN    NaN   
2019-12-26  AAPL   ...  
...                                           
2019-12-31   VER    42.4    1900 ... 50.1   1998 

My problem is that I do not know how to treat the column names as they are strings of the type 'Stock1,Feature1' and thus I do not know how to generate the approriate double-index. Can somebody help? Notice that the features are the same for every stock, i.e. the list of features does not change across stocks. Of course I am open to different type of solutions in terms of reshaping data


Solution

  • You can reshape this with pd.wide_to_long. To get the correct format I'll first modify the columns Index to make 'AAPL,Price' -> 'Price,AAPL'.

    df.columns = [','.join(x[::-1]) for x in df.columns.str.split(',')]
    stubs = df.columns.str.split(',').str[0].unique().tolist()
    #['Price', 'Volume', 'Ivol', 'Shares']    
    
    df = df.reset_index('Date')  # Need it out of index
    df = pd.wide_to_long(df, i='Date', j='Stock', stubnames=stubs, sep=',', suffix='.*')
    

    #df, based on the 4 columns provided:
    
                      Price  Volume  Ivol  Shares
    Date       Stock                             
    2019-12-25 AAPL    21.5  1879.0   NaN     NaN
               GOOG     NaN     NaN  20.0  2010.0
    2019-12-26 AAPL    22.1  1887.0   NaN     NaN
               GOOG     NaN     NaN  19.9  2000.0
    2019-12-27 AAPL    23.0  1888.0   NaN     NaN
               GOOG     NaN     NaN  19.9  2045.0
    2019-12-30 AAPL    22.3  1887.0   NaN     NaN
               GOOG     NaN     NaN   NaN  2050.0
    2019-12-31 AAPL    22.4  1900.0   NaN     NaN
               GOOG     NaN     NaN  20.1  1998.0
    

    Also perhaps more simply, you could create the columns MultiIndex and then stack

    df.columns = pd.MultiIndex.from_tuples([tuple(x) for x in df.columns.str.split(',')])
    
    df = df.stack(level=0) # Index level won't have a name.