I have a multiindex dataframe, df1, pulled from yahoo using yfinance, with a timeseries index and two-level multiindex columns. Level 0 is 'adj close', 'high' and 'volume', and level 1 is a list of company symbols, like this:
Adj Close | High | Volume | |||||||
---|---|---|---|---|---|---|---|---|---|
C1 | C2 | C3 | C1 | C2 | C3 | C1 | C2 | C3 | |
Date | |||||||||
02-01-2020 | 12.78 | 41.73 | 24.03 | 13.50 | 41.77 | 26.43 | 100000 | 1234300 | 23454 |
03-01-2020 | 12.29 | 41.11 | 23.61 | 12.77 | 42.09 | 23.99 | 100022 | 1555555 | 23999 |
Im trying to make a new dataframe, df2, by making calculations on the columns of df1.
For example, using the same timeseries index, Id like to make the first column in the new dataframe, df2, the percent increase for each of the (level 1) symbols in the (level 0) price column of df1. Then columns in df2 being rolling moving average of the Price (level 0) each symbol (level 1). Also a column that writes a boolean True if the High of today is higher than the Adj Close of yesterday.
Basically, each column in df2 will be made by manipulating level 0 data from the corresponding level 1 company symbols in df1.
(Im saying column I now know its not a column, but a group of columns grouped by the level 0)
So I wrote a function:
def indicators_df(df1):
# Create a new DataFrame with the same index as df1
df2 = pd.DataFrame(index=df1.index)
# Plus / minus change %
df2['Pct'] = df1['Adj Close'].pct_change().fillna(0)
# Price MAs
df2['$10MA'] = df1['Adj Close'].rolling(window=10).mean()
df2['$20MA'] = df1['Adj Close'].rolling(window=20).mean()
df2['$50MA'] = df1['Adj Close'].rolling(window=50).mean()
df2['$100MA'] = df1['Adj Close'].rolling(window=100).mean()
df2['$200MA'] = df1['Adj Close'].rolling(window=200).mean()
# Volume MAs
df2['V10MA'] = df1['Volume'].rolling(window=10).mean()
df2['V20MA'] = df1['Volume'].rolling(window=20).mean()
df2['V50MA'] = df1['Volume'].rolling(window=50).mean()
df2['V100MA'] = df1['Volume'].rolling(window=100).mean()
df2['V200MA'] = df1['Volume'].rolling(window=200).mean()
return df2
which returns this error:
ValueError: Cannot set a DataFrame with multiple columns to the single column Pct
I tried loads of variations, but kept getting the same error until I realised (I think...) Im trying to put the results for all the level 1 company symbols in one column in the new dataframe, df2, which is not multiindex.
So how would I write the function correctly, creating a new multiindex dataframe where the level 0 values/keys (not sure of the correct terminology) are the calculations Ive made on df1, such as percent change, rolling average etc?
Thanks.
IIUC, as a simple approach, you can concat
multiple xs
on a predefined pair of computations :
def indicators_df(df):
def xs(df, m):
return df.xs(m, axis=1, drop_level=False)
def rn(df, d):
return df.rename(d, axis=1, level=0)
pct = (df.pipe(xs, "Adj Close").pct_change()
.fillna(0).pipe(rn, {"Adj Close": "Pct"}))
tmp = df.pipe(xs, "High")
chk = tmp.eq(tmp.shift()).pipe(rn, {"High": "Check"})
_map = {k: [(w, f"{pre}{w}MA") for w in [10, 20, 50, 100, 200]]
for k, pre in [("Adj Close", "$"), ("Volume", "V")]}
pvs = [df.pipe(xs, l0).rolling(b).mean().pipe(rn, {l0:l1})
for (l0, p) in _map.items() for (b,l1) in p]
return pd.concat([pct, chk, *pvs], axis=1)
Output :
print(indicators_df(df))
Pct Check ... V100MA V200MA
C1 C2 C3 C1 C2 ... C2 C3 C1 C2 C3
Date ...
02-01-2020 0.00 0.00 0.00 False False ... NaN NaN NaN NaN NaN
03-01-2020 -0.04 -0.01 -0.02 False False ... NaN NaN NaN NaN NaN
[2 rows x 36 columns]
Used input :
df = pd.DataFrame.from_dict(
{'index': ['02-01-2020', '03-01-2020'],
'columns': [
('Adj Close', 'C1'), ('Adj Close', 'C2'), ('Adj Close', 'C3'),
('High', 'C1'), ('High', 'C2'), ('High', 'C3'), ('Volume', 'C1'),
('Volume', 'C2'),('Volume', 'C3')],
'data': [[12.78, 41.73, 24.03, 13.5, 41.77,
26.43, 100000, 1234300, 23454],
[12.29, 41.11, 23.61, 12.77, 42.09,
23.99, 100022, 1555555, 23999]],
'index_names': ['Date'],
'column_names': [None, None]}, orient='tight')