I have the following dataframe:
AAPL,Price AAPL,Volume ... GOOG,Ivol GOOG,Shares
Date
0 2019-12-25 21.5 1879 20.0 2010
1 2019-12-26 22.1 1887 19.9 2000
2 2019-12-27 23.0 1888 19.9 2045
3 2019-12-30 22.3 1887 NaN 2050
4 2019-12-31 22.4 1900 20.1 1998
I will need to run regressions for every one of these stocks features, and I will have to attach a set of specific dummies. Therefore, my aim is to reshape the database such that I have a double index made of Date in first place and Stock name in second place, i.e.
Date Stock Price Volume ... Ivol Shares
2019-12-25 AAPL 21.5 1879 ... 22.1 3121
... ... ... ... ... ... ...
2019-12-25 GOOG 45.8 NaN ... 20.0 2000
...
2019-12-25 VER NaN NaN ... NaN NaN
2019-12-26 AAPL ...
...
2019-12-31 VER 42.4 1900 ... 50.1 1998
My problem is that I do not know how to treat the column names as they are strings of the type 'Stock1,Feature1' and thus I do not know how to generate the approriate double-index. Can somebody help? Notice that the features are the same for every stock, i.e. the list of features does not change across stocks. Of course I am open to different type of solutions in terms of reshaping data
You can reshape this with pd.wide_to_long
. To get the correct format I'll first modify the columns Index to make 'AAPL,Price'
-> 'Price,AAPL'
.
df.columns = [','.join(x[::-1]) for x in df.columns.str.split(',')]
stubs = df.columns.str.split(',').str[0].unique().tolist()
#['Price', 'Volume', 'Ivol', 'Shares']
df = df.reset_index('Date') # Need it out of index
df = pd.wide_to_long(df, i='Date', j='Stock', stubnames=stubs, sep=',', suffix='.*')
#df, based on the 4 columns provided:
Price Volume Ivol Shares
Date Stock
2019-12-25 AAPL 21.5 1879.0 NaN NaN
GOOG NaN NaN 20.0 2010.0
2019-12-26 AAPL 22.1 1887.0 NaN NaN
GOOG NaN NaN 19.9 2000.0
2019-12-27 AAPL 23.0 1888.0 NaN NaN
GOOG NaN NaN 19.9 2045.0
2019-12-30 AAPL 22.3 1887.0 NaN NaN
GOOG NaN NaN NaN 2050.0
2019-12-31 AAPL 22.4 1900.0 NaN NaN
GOOG NaN NaN 20.1 1998.0
Also perhaps more simply, you could create the columns MultiIndex and then stack
df.columns = pd.MultiIndex.from_tuples([tuple(x) for x in df.columns.str.split(',')])
df = df.stack(level=0) # Index level won't have a name.