I am running an analysis that could benefit from clustering by BEA regions. I have not used the clustered standard error option in Statsmodels before, so I am unclear of whether or not I am messing up the syntax, or the option is broken. Any help would be greatly appreciated.
Here is the relevant section of code (note that the topline_specs
dict returns Patsy-style formulas):
#Capture topline specs
topline_specs={'GO':spec_dict['PC_GO']['Total']['TYPE']['BOTH'],
'RV':spec_dict['PC_RV']['Total']['TYPE']['BOTH'],
'ISSUER':spec_dict['PROP']['ISSUER']['TYPE']['BOTH'],
'PURPOSE':spec_dict['PROP']['PURPOSE']['TYPE']['BOTH']}
#Estimate each model
topline_mods={'GO':smf.ols(formula=topline_specs['GO'],data=data_d).fit(cov_type='cluster',
cov_kwds={'groups':data_d['BEA_INT']})}
topline_mods['GO']
The traceback stems from a numpy call. It returns the following:
ValueError: The weights and list don't have the same length.
Everything I could find on the use of clustered standard errors looked like the cov_kwds
argument can take a Series from the DataFrame housing the model data. What am I missing?
When a model is created with formulas, then the missing value handling defaults to 'drop', and rows with missing observations are dropped from all data arrays given to the model (__init__
). In the non-formula interface the default is currently to ignore missing values.
However, there is currently no check and automatic dropping of missing values in the arrays that are given at a later point, in this case data that is required in cov_kwds
. If this has the original set of observations, but some have been dropped in the dependent and explanatory variables, then there will be a length mismatch, and it will raise the reported exception.
I reopened https://github.com/statsmodels/statsmodels/issues/1220 because it is possible to handle missing values in the special cases where we have enough information through the pandas indices.