I have a dataframe of the type:
date TICKER x1 x2 ... Z Y month x3
0 1999-12-31 A UN Equity 52.1330 51.9645 ... 0.0052 NaN 12 NaN
1 1999-12-31 AA UN Equity 92.9415 92.8715 ... 0.0052 NaN 12 NaN
2 1999-12-31 ABC UN Equity 3.6843 3.6539 ... 0.0052 NaN 12 NaN
3 1999-12-31 ABF UN Equity 22.0625 21.9375 ... 0.0052 NaN 12 NaN
4 1999-12-31 ABM UN Equity 10.2188 10.1250 ... 0.0052 NaN 12 NaN
I would like to run an OLS regression from the formula 'Y ~ x1 + x2:x3'
by the group ['TICKER','year','month']
(year is a column which does not appear here) from statsmodels.formula.api as smf
. I therefore use:
data.groupby(['TICKER','year','month']).apply(lambda x: smf.ols(formula='Y ~ x1 + x2:x3', data=x))
However, I get the following error:
IndexError: tuple index out of range
Any idea why?
The full tracebakc is
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\groupby.py", line 894, in apply
result = self._python_apply_general(f, self._selected_obj)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\groupby.py", line 928, in _python_apply_general
keys, values, mutated = self.grouper.apply(f, data, self.axis)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\ops.py", line 238, in apply
res = f(group)
File "<input>", line 1, in <lambda>
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 195, in from_formula
mod = cls(endog, exog, *args, **kwargs)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 872, in __init__
super(OLS, self).__init__(endog, exog, missing=missing,
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 703, in __init__
super(WLS, self).__init__(endog, exog, missing=missing,
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 190, in __init__
super(RegressionModel, self).__init__(endog, exog, **kwargs)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 237, in __init__
super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 77, in __init__
self.data = self._handle_data(endog, exog, missing, hasconst,
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 101, in _handle_data
data = handle_data(endog, exog, missing, hasconst, **kwargs)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 672, in handle_data
return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 71, in __init__
arrays, nan_idx = self.handle_missing(endog, exog, missing,
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 247, in handle_missing
if combined_nans.shape[0] != nan_mask.shape[0]:
IndexError: tuple index out of range
I see that your Y
columns has a lot of NaNs, so you need to ensure that the subgroup has enough observations, so that the regression can work.
So if I use an example data:
import statsmodels.formula.api as smf
np.random.seed(123)
data = pd.concat([
pd.DataFrame({'TICKER':np.random.choice(['A','B','C'],30),
'year':np.random.choice([2000,2001],30),
'month':np.random.choice([1,2],30)}),
pd.DataFrame(np.random.normal(0,1,(30,4)),columns=['Y','x1','x2','x3'])
],axis=1)
data.loc[:6,'Y'] = np.nan
If I run your code on the data frame above, I get the same error.
So if we use only complete data (relevant for your regression):
complete_ix = data[['Y','x1','x2','x3']].dropna().index
data.loc[complete_ix].groupby(['TICKER','year','month']).apply(lambda x: smf.ols(formula='Y ~ x1 + x2:x3', data=x))
It works:
TICKER year month
A 2000 2 <statsmodels.regression.linear_model.OLS objec...
2001 1 <statsmodels.regression.linear_model.OLS objec...
2 <statsmodels.regression.linear_model.OLS objec...
B 2000 1 <statsmodels.regression.linear_model.OLS objec...
2 <statsmodels.regression.linear_model.OLS objec...
2001 1 <statsmodels.regression.linear_model.OLS objec...
C 2000 1 <statsmodels.regression.linear_model.OLS objec...
2 <statsmodels.regression.linear_model.OLS objec...