Search code examples
pythonnumpypandasquantitative-finance

DataFrame take every 3rd row and forward fill


I have a DataFrame with 'Date' and 'Id' in the index and 'Portfolio' in the columns. Values are weights of security within the portfolio. Within the dates level of the index, I'd like to take every 3rd date and forward fill the security weight to the date subsequent to the next "every third" date.

Setup

This is a generic DataFrame producer. with df asigned at the end.

import pandas as pd
import numpy as np
from string import uppercase

def generic_portfolio_df(start, end, freq, num_port, num_sec, seed=314):
    np.random.seed(seed)
    portfolios = pd.Index(['Portfolio {}'.format(i) for i in uppercase[:num_port]],
                          name='Portfolio')
    securities = ['s{:02d}'.format(i) for i in range(num_sec)]
    dates = pd.date_range(start, end, freq=freq)
    return pd.DataFrame(np.random.rand(len(dates) * num_sec, num_port),
                        index=pd.MultiIndex.from_product([dates, securities],
                                                         names=['Date', 'Id']),
                        columns=portfolios
                       ).groupby(level=0).apply(lambda x: x / x.sum())    

df = generic_portfolio_df('2014-12-31', '2015-05-30', 'BM', 3, 5)

df looks like this:

Portfolio       Portfolio A  Portfolio B  Portfolio C
Date       Id                                        
2014-12-31 s00     0.326164     0.201597     0.085340
           s01     0.278614     0.314448     0.266392
           s02     0.258958     0.089224     0.293570
           s03     0.092760     0.262511     0.084208
           s04     0.043503     0.132221     0.270490
2015-01-30 s00     0.094124     0.041722     0.248013
           s01     0.197860     0.346862     0.265287
           s02     0.232504     0.261939     0.125719
           s03     0.193050     0.286359     0.337316
           s04     0.282462     0.063118     0.023664
2015-02-27 s00     0.266900     0.484163     0.074970
           s01     0.239319     0.083138     0.123289
           s02     0.067958     0.262626     0.262548
           s03     0.181974     0.108668     0.301149
           s04     0.243849     0.061405     0.238044
2015-03-31 s00     0.321438     0.149010     0.125168
           s01     0.217779     0.067209     0.040285
           s02     0.173066     0.293539     0.417372
           s03     0.048929     0.415637     0.216490
           s04     0.238788     0.074605     0.200685
2015-04-30 s00     0.089122     0.135514     0.234565
           s01     0.048235     0.028141     0.327739
           s02     0.026016     0.039664     0.073588
           s03     0.413139     0.397875     0.323671
           s04     0.423487     0.398807     0.040437
2015-05-29 s00     0.135831     0.071604     0.235099
           s01     0.240086     0.242436     0.131698
           s02     0.304451     0.380368     0.101653
           s03     0.213468     0.035276     0.372894
           s04     0.106164     0.270317     0.158656

Question

Within the dates level of the index, I'd like to take every 3rd date and forward fill the security weight to the date subsequent to the next "every third" date.

I want it to look like:

Portfolio       Portfolio A  Portfolio B  Portfolio C
Date       Id                                        
2014-12-31 s00     0.326164     0.201597     0.085340
           s01     0.278614     0.314448     0.266392
           s02     0.258958     0.089224     0.293570
           s03     0.092760     0.262511     0.084208
           s04     0.043503     0.132221     0.270490
2015-01-30 s00     0.326164     0.201597     0.085340
           s01     0.278614     0.314448     0.266392
           s02     0.258958     0.089224     0.293570
           s03     0.092760     0.262511     0.084208
           s04     0.043503     0.132221     0.270490
2015-02-27 s00     0.326164     0.201597     0.085340
           s01     0.278614     0.314448     0.266392
           s02     0.258958     0.089224     0.293570
           s03     0.092760     0.262511     0.084208
           s04     0.043503     0.132221     0.270490
2015-03-31 s00     0.321438     0.149010     0.125168
           s01     0.217779     0.067209     0.040285
           s02     0.173066     0.293539     0.417372
           s03     0.048929     0.415637     0.216490
           s04     0.238788     0.074605     0.200685
2015-04-30 s00     0.321438     0.149010     0.125168
           s01     0.217779     0.067209     0.040285
           s02     0.173066     0.293539     0.417372
           s03     0.048929     0.415637     0.216490
           s04     0.238788     0.074605     0.200685
2015-05-29 s00     0.321438     0.149010     0.125168
           s01     0.217779     0.067209     0.040285
           s02     0.173066     0.293539     0.417372
           s03     0.048929     0.415637     0.216490
           s04     0.238788     0.074605     0.200685

Conclusion

Though I'm still interested in other's answers. I chose Alexander's answer over my own for the following reason:

%%timeit
    df = generic_portfolio_df('2014-12-31', '2015-05-30', 'BM', 3, 5)
    df = df.unstack()
    df.iloc[3:] = np.nan
    df = df.ffill(limit=3).stack()

100 loops, best of 3: 11.6 ms per loop

%%timeit
    df = generic_portfolio_df('2014-12-31', '2015-05-30', 'BM', 3, 5)
    df0 = df.loc[pd.IndexSlice[::3, :], :]
    diff = df.index.difference(df0.index)
    df.ix[diff] = np.nan
    df.groupby(level=1).ffill(limit=3)

100 loops, best of 3: 21 ms per loop

Apparently, using stack and unstack is more efficient.


Solution

  • # Create Boolean index of rows to delete (every third row is marked as False).
    idx = len(df.unstack())
    idx = [i % 3 > 0 for i in range(idx)]
    >>> idx
    [False, True, True, False, True, True]
    
    # Unstack the dataframe so you just have a column of dates 
    df = df.unstack()
    
    # Delete those in the `idx` index.
    df.loc[idx, :] = np.nan
    
    # Forward fill the retained dates, and then restack your dataframe.
    df = df.ffill(limit=3).stack()
    
    >>> df.tail()
    Portfolio       Portfolio A  Portfolio B  Portfolio C
    Date       Id                                        
    2015-05-29 s00     0.321438     0.149010     0.125168
               s01     0.217779     0.067209     0.040285
               s02     0.173066     0.293539     0.417372
               s03     0.048929     0.415637     0.216490
               s04     0.238788     0.074605     0.200685