Search code examples
pythonpandasregressionfinance

pandas dataframe daily regression at monthly frequency


I have a daily return data of a number of stocks that looks like:

           Stock A  Stock B  Stock C  Market
date
1987-02-02  0.01    0.02      0.02     0.01
1987-02-03  0.02    0.03      0.02     0.02
1987-02-04  0.03    0.01      0.01     0.03
1987-02-05  0.04    0.03      0.05     0.04

I want to calculate 30 day regression of Stock A, B, C on market, but only at the end of each month, i.e. at 1987-02-28, 1987-03-31... And then save the regression results in two matrices (one for the constant term, and one for the coefficient):

           Stock A  Stock B  Stock C  
date
1987-02-28  const    const     const   
1987-03-31  const    const     const  
1987-04-30  const    const     const 

           Stock A  Stock B  Stock C 
date
1987-02-28  coeff    coeff     coeff 
1987-03-31  coeff    coeff     coeff 
1987-04-30  coeff    coeff     coeff 

So far what I did was to create an indicator for "end of month" and then loop over all rows and columns of the date:

loop over columns:
    loop over rows:
        if end of month is true:
            regress Stock on Market using past 30 days data
            save the result 

Given I have a lot of stocks (7000+) over 50 years, this way is very slow. I am wondering if anyone has worked on similar problems before and has faster way of implementing this? Any tip on how to improve the speed or efficiency would be greatly appreciated.


Solution

  • You could start with an approach like this instead

    import pandas as pd
    import numpy as np
    from datetime import datetime
    from pandas.tseries.offsets import MonthEnd
    
    #sample Data
    df = pd.DataFrame(index=pd.DatetimeIndex(freq='D',start=datetime(1990,1,1),end=datetime(1995,12,25)),data=np.random.randint(0,100,(2185,3)),columns=['Stock A','Stock B','Stock C'])
    
    #Create a column that has the end of the month for each date
    df['end'] = df.index + MonthEnd(1)
    
    #Groupby the end of the month and apply your regression function
    for group,data in df.groupby('end'):
        for row in data.columns #Sudo code loop over columns used per regression
            regressFunction()
    

    This should eliminate wasted time looping over rows and columns and instead, just compute the regression on known indices. Also, it maybe beneficial to compute the regression in parallel to speed up computation.