Search code examples
pythonstatsmodels

Statsmodels - OLS Clustered Standard Errors (not accepting Series from DF?)


I am running an analysis that could benefit from clustering by BEA regions. I have not used the clustered standard error option in Statsmodels before, so I am unclear of whether or not I am messing up the syntax, or the option is broken. Any help would be greatly appreciated.

Here is the relevant section of code (note that the topline_specs dict returns Patsy-style formulas):

#Capture topline specs
topline_specs={'GO':spec_dict['PC_GO']['Total']['TYPE']['BOTH'],
               'RV':spec_dict['PC_RV']['Total']['TYPE']['BOTH'],
               'ISSUER':spec_dict['PROP']['ISSUER']['TYPE']['BOTH'],
               'PURPOSE':spec_dict['PROP']['PURPOSE']['TYPE']['BOTH']}

#Estimate each model
topline_mods={'GO':smf.ols(formula=topline_specs['GO'],data=data_d).fit(cov_type='cluster',
                                                                       cov_kwds={'groups':data_d['BEA_INT']})}

topline_mods['GO']

The traceback stems from a numpy call. It returns the following:

ValueError: The weights and list don't have the same length.

Everything I could find on the use of clustered standard errors looked like the cov_kwds argument can take a Series from the DataFrame housing the model data. What am I missing?


Solution

  • When a model is created with formulas, then the missing value handling defaults to 'drop', and rows with missing observations are dropped from all data arrays given to the model (__init__). In the non-formula interface the default is currently to ignore missing values.

    However, there is currently no check and automatic dropping of missing values in the arrays that are given at a later point, in this case data that is required in cov_kwds. If this has the original set of observations, but some have been dropped in the dependent and explanatory variables, then there will be a length mismatch, and it will raise the reported exception.

    I reopened https://github.com/statsmodels/statsmodels/issues/1220 because it is possible to handle missing values in the special cases where we have enough information through the pandas indices.