Search code examples
pythonpandasgroup-bystatsmodels

statsmodels ols from formula with groupby pandas


I have a dataframe of the type:

       date         TICKER        x1       x2  ...       Z        Y  month    x3
0 1999-12-31    A UN Equity  52.1330  51.9645  ...  0.0052      NaN     12   NaN
1 1999-12-31   AA UN Equity  92.9415  92.8715  ...  0.0052      NaN     12   NaN
2 1999-12-31  ABC UN Equity   3.6843   3.6539  ...  0.0052      NaN     12   NaN
3 1999-12-31  ABF UN Equity  22.0625  21.9375  ...  0.0052      NaN     12   NaN
4 1999-12-31  ABM UN Equity  10.2188  10.1250  ...  0.0052      NaN     12   NaN

I would like to run an OLS regression from the formula 'Y ~ x1 + x2:x3' by the group ['TICKER','year','month'] (year is a column which does not appear here) from statsmodels.formula.api as smf. I therefore use:

data.groupby(['TICKER','year','month']).apply(lambda x: smf.ols(formula='Y ~ x1 + x2:x3', data=x))

However, I get the following error:

IndexError: tuple index out of range

Any idea why?

The full tracebakc is

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\groupby.py", line 894, in apply
    result = self._python_apply_general(f, self._selected_obj)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\groupby.py", line 928, in _python_apply_general
    keys, values, mutated = self.grouper.apply(f, data, self.axis)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\ops.py", line 238, in apply
    res = f(group)
  File "<input>", line 1, in <lambda>
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 195, in from_formula
    mod = cls(endog, exog, *args, **kwargs)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 872, in __init__
    super(OLS, self).__init__(endog, exog, missing=missing,
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 703, in __init__
    super(WLS, self).__init__(endog, exog, missing=missing,
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 190, in __init__
    super(RegressionModel, self).__init__(endog, exog, **kwargs)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 237, in __init__
    super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 77, in __init__
    self.data = self._handle_data(endog, exog, missing, hasconst,
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 101, in _handle_data
    data = handle_data(endog, exog, missing, hasconst, **kwargs)
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 672, in handle_data
    return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 71, in __init__
    arrays, nan_idx = self.handle_missing(endog, exog, missing,
  File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 247, in handle_missing
    if combined_nans.shape[0] != nan_mask.shape[0]:
IndexError: tuple index out of range

Solution

  • I see that your Y columns has a lot of NaNs, so you need to ensure that the subgroup has enough observations, so that the regression can work.

    So if I use an example data:

    import statsmodels.formula.api as smf
    np.random.seed(123)
    data = pd.concat([
        pd.DataFrame({'TICKER':np.random.choice(['A','B','C'],30),
                        'year':np.random.choice([2000,2001],30),
                        'month':np.random.choice([1,2],30)}),
        pd.DataFrame(np.random.normal(0,1,(30,4)),columns=['Y','x1','x2','x3'])
    ],axis=1)
    
    data.loc[:6,'Y'] = np.nan
    

    If I run your code on the data frame above, I get the same error.

    So if we use only complete data (relevant for your regression):

    complete_ix = data[['Y','x1','x2','x3']].dropna().index
    data.loc[complete_ix].groupby(['TICKER','year','month']).apply(lambda x: smf.ols(formula='Y ~ x1 + x2:x3', data=x))
    

    It works:

    TICKER  year  month
    A       2000  2        <statsmodels.regression.linear_model.OLS objec...
            2001  1        <statsmodels.regression.linear_model.OLS objec...
                  2        <statsmodels.regression.linear_model.OLS objec...
    B       2000  1        <statsmodels.regression.linear_model.OLS objec...
                  2        <statsmodels.regression.linear_model.OLS objec...
            2001  1        <statsmodels.regression.linear_model.OLS objec...
    C       2000  1        <statsmodels.regression.linear_model.OLS objec...
                  2        <statsmodels.regression.linear_model.OLS objec...