Just approached the hierarchical columns in pandas. The original dataframe (df) has 27 columns and looks like the following (Ticker is the index):
Report Date Shares Gross Profit ...
Ticker
AAPL 2010-07-31 347000000.0 543000000.0 ...
AAPL 2010-10-31 344000000.0 548000000.0 ...
AAPL 2011-01-31 347000000.0 556000000.0 ...
AAPL 2011-04-30 347000000.0 580000000.0 ...
AAPL 2011-07-31 348000000.0 591000000.0 ...
I'd like to modify the column structure so that the first level is Report Date, the second level is the columns Shares and Gross Profit. I tried to create a new dataframe with this structure for just one ticker (AAPL), this is the code I used:
col = pd.MultiIndex.from_product([df['Report Date'], df[['Shares', 'Gross Profit']]])
df1 = pd.DataFrame(df.loc['AAPL'], columns=col)
It seems working apparently, but there are just NaN:
Report Date 2010-07-31 2010-10-31 \
Shares Gross Profit Shares Gross Profit
Ticker
AAPL NaN NaN NaN NaN
AAPL NaN NaN NaN NaN
AAPL NaN NaN NaN NaN
AAPL NaN NaN NaN NaN
Moreover, the shape exploded to (78, 112668). Can anybody spot the error? I guess it's in MultiIndex.from_product but cannot understand where.
This question can be solved by df.melt() if we aim at producing the transposed version of the desired output first. You can easily set the double-leveled MultiIndex
before df.transpose()
.
df_want = df.melt(id_vars="Report Date", value_vars=["Shares", "Gross Profit"])\
.sort_values(["Report Date", "variable"])\
.set_index(["Report Date", "variable"])\
.transpose()
print(df_want)
Report Date 2010-07-31 ... 2011-07-31
variable Gross Profit Shares ... Gross Profit Shares
value 543000000.0 347000000.0 ... 591000000.0 348000000.0
[1 rows x 10 columns]
N.B. The problem of the original attempt: IMO a better data wrangling strategy is to let the desired indexes/columns be produced naturally within the data processing pipeline or set via standard Pandas APIs, especially when the names or indexes/columns are already present in the source dataframe.
Edit: "Produce the desired index/columns naturally" means to NOT compute them outside the df.f1(...).f2(...).f3(...)...
pipeline and assign the externally generated index/columns to the output DataFrame. Such approach can produce less error-prone and more maintainable code in general.
In other words, manually generating indexes or column names is not likely a Pandastic way to go, except maybe for pre-allocation of empty dataframes.
I think there is a great chance that dealing with multiple tickers at once would be a realistic use case. So I also provide such a generalized version just in case. This solution is of course also applicable for single-ticker dataframes.
Data
Report Date Shares Gross Profit
Ticker
AAPL 2010-07-31 347000000.0 543000000.0
AAPL 2010-10-31 344000000.0 548000000.0
AAPL 2011-01-31 347000000.0 556000000.0
AAPL 2011-04-30 347000000.0 580000000.0
AAPL 2011-07-31 348000000.0 591000000.0
GOOG 2011-07-31 448000004.0 691000000.0
GOOG 2010-07-31 447000004.0 643000000.0
GOOG 2010-10-31 444000004.0 648000000.0
GOOG 2011-01-31 447000004.0 656000000.0
GOOG 2011-04-30 447000004.0 680000000.0
Code
df_want = df.reset_index()\
.melt(id_vars=["Ticker", "Report Date"], value_vars=["Shares", "Gross Profit"])\
.sort_values(["Ticker", "Report Date", "variable"])\
.pivot(index="Ticker", columns=["Report Date", "variable"], values="value")
Result
print(df_want)
Report Date 2010-07-31 ... 2011-07-31
variable Gross Profit Shares ... Gross Profit Shares
Ticker ...
AAPL 543000000.0 347000000.0 ... 591000000.0 348000000.0
GOOG 643000000.0 447000004.0 ... 691000000.0 448000004.0
[2 rows x 10 columns]
I am using pandas v1.1.3 and python 3.7 on a 64-bit debian 10 laptop.