I have a DataFrame
with 'Date'
and 'Id'
in the index and 'Portfolio'
in the columns. Values are weights of security within the portfolio. Within the dates level of the index, I'd like to take every 3rd date and forward fill the security weight to the date subsequent to the next "every third" date.
This is a generic DataFrame
producer. with df
asigned at the end.
import pandas as pd
import numpy as np
from string import uppercase
def generic_portfolio_df(start, end, freq, num_port, num_sec, seed=314):
np.random.seed(seed)
portfolios = pd.Index(['Portfolio {}'.format(i) for i in uppercase[:num_port]],
name='Portfolio')
securities = ['s{:02d}'.format(i) for i in range(num_sec)]
dates = pd.date_range(start, end, freq=freq)
return pd.DataFrame(np.random.rand(len(dates) * num_sec, num_port),
index=pd.MultiIndex.from_product([dates, securities],
names=['Date', 'Id']),
columns=portfolios
).groupby(level=0).apply(lambda x: x / x.sum())
df = generic_portfolio_df('2014-12-31', '2015-05-30', 'BM', 3, 5)
df
looks like this:
Portfolio Portfolio A Portfolio B Portfolio C
Date Id
2014-12-31 s00 0.326164 0.201597 0.085340
s01 0.278614 0.314448 0.266392
s02 0.258958 0.089224 0.293570
s03 0.092760 0.262511 0.084208
s04 0.043503 0.132221 0.270490
2015-01-30 s00 0.094124 0.041722 0.248013
s01 0.197860 0.346862 0.265287
s02 0.232504 0.261939 0.125719
s03 0.193050 0.286359 0.337316
s04 0.282462 0.063118 0.023664
2015-02-27 s00 0.266900 0.484163 0.074970
s01 0.239319 0.083138 0.123289
s02 0.067958 0.262626 0.262548
s03 0.181974 0.108668 0.301149
s04 0.243849 0.061405 0.238044
2015-03-31 s00 0.321438 0.149010 0.125168
s01 0.217779 0.067209 0.040285
s02 0.173066 0.293539 0.417372
s03 0.048929 0.415637 0.216490
s04 0.238788 0.074605 0.200685
2015-04-30 s00 0.089122 0.135514 0.234565
s01 0.048235 0.028141 0.327739
s02 0.026016 0.039664 0.073588
s03 0.413139 0.397875 0.323671
s04 0.423487 0.398807 0.040437
2015-05-29 s00 0.135831 0.071604 0.235099
s01 0.240086 0.242436 0.131698
s02 0.304451 0.380368 0.101653
s03 0.213468 0.035276 0.372894
s04 0.106164 0.270317 0.158656
Within the dates level of the index, I'd like to take every 3rd date and forward fill the security weight to the date subsequent to the next "every third" date.
I want it to look like:
Portfolio Portfolio A Portfolio B Portfolio C
Date Id
2014-12-31 s00 0.326164 0.201597 0.085340
s01 0.278614 0.314448 0.266392
s02 0.258958 0.089224 0.293570
s03 0.092760 0.262511 0.084208
s04 0.043503 0.132221 0.270490
2015-01-30 s00 0.326164 0.201597 0.085340
s01 0.278614 0.314448 0.266392
s02 0.258958 0.089224 0.293570
s03 0.092760 0.262511 0.084208
s04 0.043503 0.132221 0.270490
2015-02-27 s00 0.326164 0.201597 0.085340
s01 0.278614 0.314448 0.266392
s02 0.258958 0.089224 0.293570
s03 0.092760 0.262511 0.084208
s04 0.043503 0.132221 0.270490
2015-03-31 s00 0.321438 0.149010 0.125168
s01 0.217779 0.067209 0.040285
s02 0.173066 0.293539 0.417372
s03 0.048929 0.415637 0.216490
s04 0.238788 0.074605 0.200685
2015-04-30 s00 0.321438 0.149010 0.125168
s01 0.217779 0.067209 0.040285
s02 0.173066 0.293539 0.417372
s03 0.048929 0.415637 0.216490
s04 0.238788 0.074605 0.200685
2015-05-29 s00 0.321438 0.149010 0.125168
s01 0.217779 0.067209 0.040285
s02 0.173066 0.293539 0.417372
s03 0.048929 0.415637 0.216490
s04 0.238788 0.074605 0.200685
Though I'm still interested in other's answers. I chose Alexander's answer over my own for the following reason:
%%timeit
df = generic_portfolio_df('2014-12-31', '2015-05-30', 'BM', 3, 5)
df = df.unstack()
df.iloc[3:] = np.nan
df = df.ffill(limit=3).stack()
100 loops, best of 3: 11.6 ms per loop
%%timeit
df = generic_portfolio_df('2014-12-31', '2015-05-30', 'BM', 3, 5)
df0 = df.loc[pd.IndexSlice[::3, :], :]
diff = df.index.difference(df0.index)
df.ix[diff] = np.nan
df.groupby(level=1).ffill(limit=3)
100 loops, best of 3: 21 ms per loop
Apparently, using stack
and unstack
is more efficient.
# Create Boolean index of rows to delete (every third row is marked as False).
idx = len(df.unstack())
idx = [i % 3 > 0 for i in range(idx)]
>>> idx
[False, True, True, False, True, True]
# Unstack the dataframe so you just have a column of dates
df = df.unstack()
# Delete those in the `idx` index.
df.loc[idx, :] = np.nan
# Forward fill the retained dates, and then restack your dataframe.
df = df.ffill(limit=3).stack()
>>> df.tail()
Portfolio Portfolio A Portfolio B Portfolio C
Date Id
2015-05-29 s00 0.321438 0.149010 0.125168
s01 0.217779 0.067209 0.040285
s02 0.173066 0.293539 0.417372
s03 0.048929 0.415637 0.216490
s04 0.238788 0.074605 0.200685