statsmodels ols from formula with groupby pandas

I have a dataframe of the type:

       date         TICKER        x1       x2  ...       Z        Y  month    x3
0 1999-12-31    A UN Equity  52.1330  51.9645  ...  0.0052      NaN     12   NaN
1 1999-12-31   AA UN Equity  92.9415  92.8715  ...  0.0052      NaN     12   NaN
2 1999-12-31  ABC UN Equity   3.6843   3.6539  ...  0.0052      NaN     12   NaN
3 1999-12-31  ABF UN Equity  22.0625  21.9375  ...  0.0052      NaN     12   NaN
4 1999-12-31  ABM UN Equity  10.2188  10.1250  ...  0.0052      NaN     12   NaN

I would like to run an OLS regression from the formula 'Y ~ x1 + x2:x3' by the group ['TICKER','year','month'] (year is a column which does not appear here) from statsmodels.formula.api as smf. I therefore use:

data.groupby(['TICKER','year','month']).apply(lambda x: smf.ols(formula='Y ~ x1 + x2:x3', data=x))

However, I get the following error:

IndexError: tuple index out of range

Any idea why?

The full tracebakc is

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\groupby.py", line 894, in apply
    result = self._python_apply_general(f, self._selected_obj)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\groupby.py", line 928, in _python_apply_general
    keys, values, mutated = self.grouper.apply(f, data, self.axis)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\ops.py", line 238, in apply
    res = f(group)
  File "<input>", line 1, in <lambda>
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 195, in from_formula
    mod = cls(endog, exog, *args, **kwargs)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 872, in __init__
    super(OLS, self).__init__(endog, exog, missing=missing,
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 703, in __init__
    super(WLS, self).__init__(endog, exog, missing=missing,
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 190, in __init__
    super(RegressionModel, self).__init__(endog, exog, **kwargs)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 237, in __init__
    super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 77, in __init__
    self.data = self._handle_data(endog, exog, missing, hasconst,
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 101, in _handle_data
    data = handle_data(endog, exog, missing, hasconst, **kwargs)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 672, in handle_data
    return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 71, in __init__
    arrays, nan_idx = self.handle_missing(endog, exog, missing,
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 247, in handle_missing
    if combined_nans.shape[0] != nan_mask.shape[0]:
IndexError: tuple index out of range

Solution

I see that your Y columns has a lot of NaNs, so you need to ensure that the subgroup has enough observations, so that the regression can work.

So if I use an example data:

import statsmodels.formula.api as smf
np.random.seed(123)
data = pd.concat([
    pd.DataFrame({'TICKER':np.random.choice(['A','B','C'],30),
                    'year':np.random.choice([2000,2001],30),
                    'month':np.random.choice([1,2],30)}),
    pd.DataFrame(np.random.normal(0,1,(30,4)),columns=['Y','x1','x2','x3'])
],axis=1)

data.loc[:6,'Y'] = np.nan

If I run your code on the data frame above, I get the same error.

So if we use only complete data (relevant for your regression):

complete_ix = data[['Y','x1','x2','x3']].dropna().index
data.loc[complete_ix].groupby(['TICKER','year','month']).apply(lambda x: smf.ols(formula='Y ~ x1 + x2:x3', data=x))

It works:

TICKER  year  month
A       2000  2        <statsmodels.regression.linear_model.OLS objec...
        2001  1        <statsmodels.regression.linear_model.OLS objec...
              2        <statsmodels.regression.linear_model.OLS objec...
B       2000  1        <statsmodels.regression.linear_model.OLS objec...
              2        <statsmodels.regression.linear_model.OLS objec...
        2001  1        <statsmodels.regression.linear_model.OLS objec...
C       2000  1        <statsmodels.regression.linear_model.OLS objec...
              2        <statsmodels.regression.linear_model.OLS objec...