I have a daily return data of a number of stocks that looks like:
Stock A Stock B Stock C Market
date
1987-02-02 0.01 0.02 0.02 0.01
1987-02-03 0.02 0.03 0.02 0.02
1987-02-04 0.03 0.01 0.01 0.03
1987-02-05 0.04 0.03 0.05 0.04
I want to calculate 30 day regression of Stock A, B, C on market, but only at the end of each month, i.e. at 1987-02-28, 1987-03-31... And then save the regression results in two matrices (one for the constant term, and one for the coefficient):
Stock A Stock B Stock C
date
1987-02-28 const const const
1987-03-31 const const const
1987-04-30 const const const
Stock A Stock B Stock C
date
1987-02-28 coeff coeff coeff
1987-03-31 coeff coeff coeff
1987-04-30 coeff coeff coeff
So far what I did was to create an indicator for "end of month" and then loop over all rows and columns of the date:
loop over columns:
loop over rows:
if end of month is true:
regress Stock on Market using past 30 days data
save the result
Given I have a lot of stocks (7000+) over 50 years, this way is very slow. I am wondering if anyone has worked on similar problems before and has faster way of implementing this? Any tip on how to improve the speed or efficiency would be greatly appreciated.
You could start with an approach like this instead
import pandas as pd
import numpy as np
from datetime import datetime
from pandas.tseries.offsets import MonthEnd
#sample Data
df = pd.DataFrame(index=pd.DatetimeIndex(freq='D',start=datetime(1990,1,1),end=datetime(1995,12,25)),data=np.random.randint(0,100,(2185,3)),columns=['Stock A','Stock B','Stock C'])
#Create a column that has the end of the month for each date
df['end'] = df.index + MonthEnd(1)
#Groupby the end of the month and apply your regression function
for group,data in df.groupby('end'):
for row in data.columns #Sudo code loop over columns used per regression
regressFunction()
This should eliminate wasted time looping over rows and columns and instead, just compute the regression on known indices. Also, it maybe beneficial to compute the regression in parallel to speed up computation.