Search code examples
pythonstatsmodelsfstat

statsmodels patsy hypothesis testing categorical variable in constraint 'C()'


Hi am running the following model with statsmodel and it works fine.

from statsmodels.formula.api import ols
from statsmodels.iolib.summary2 import summary_col #for summary stats of large tables
time_FE_str = ' + C(hour_of_day) + C(day_of_week) + C(week_of_year)'
weather_2_str = ' +  C(weather_index) + rain + extreme_temperature + wind_speed'
model = ols("activity_count ~ C(city_id)"+weather_2_str+time_FE_str, data=df)
results = model.fit()
print summary_col(results).tables

print 'F-TEST:'
hypotheses = '(C(weather_index) = 0), (rain=0), (extreme_temperature=0), (wind_speed=0)'
f_test = results.f_test(hypotheses)

However, I do not know how to formulate the hypthosis for the F-test if I want to include the categorical variable C(weather_index). I tried all for me imaginable versions but I always get an error.

Did someone face this issue before?

Any ideas?

F-TEST:
Traceback (most recent call last):
  File "C:/VK/scripts_python/predict_activity.py", line 95, in <module>
    f_test = results.f_test(hypotheses)
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\statsmodels\base\model.py", line 1375, in f_test
    invcov=invcov, use_f=True)
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\statsmodels\base\model.py", line 1437, in wald_test
    LC = DesignInfo(names).linear_constraint(r_matrix)
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\patsy\design_info.py", line 536, in linear_constraint
    return linear_constraint(constraint_likes, self.column_names)
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\patsy\constraint.py", line 391, in linear_constraint
    tree = parse_constraint(code, variable_names)
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\patsy\constraint.py", line 225, in parse_constraint
    return infix_parse(_tokenize_constraint(string, variable_names),
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\patsy\constraint.py", line 184, in _tokenize_constraint
    Origin(string, offset, offset + 1))
patsy.PatsyError: unrecognized token in constraint
   (C(weather_index) = 0), (rain=0), (extreme_temperature=0), (wind_speed=0)
    ^

Solution

  • The methods t_test, wald_test and f_test are for hypothesis test on the parameters directly and not for a entire categorical or composite effect.

    Results.summary() shows the parameter names that patsy created for the categorical variables. Those can be used to create contrast or restrictions for the categorical effects.

    As alternative anova_lm directly computes the hypothesis test that a term,e.g. A categorical variable has no effect.