Search code examples
pythonstatisticsdata-sciencestatsmodelsanova

What is the function of placing a "C" before the categorical variable when performing a one-way ANOVA in statsmodel in Python?


I am performing a one-way ANOVA in the following code:

results = ols('price ~ C(make)', data=df_anova).fit()
print(results.summary())

What is the function of "C" before the categorical variable "make" (those are 22 car brands)? I don't really see something is changing when leaving the C out. This webpage (https://pythonfordatascience.org/anova-python/) states that it automatically assigns a dummy variable to your categories, excludes one of the categories and captures it as the intercept in order to make proper comparison relative to the excluded brand. But as I stated earlier, when not including the C in front of the categorical variable, nothing seems to alter.


Solution

  • The formulas in statamodels are handled by patsy.

    C(x) requires that x is treated as a categorical variable. If the values are strings, then patsy always treats the variable as a categorical variable and C is redundant in that case.

    C forces numeric values like integers to be treated as categorical, which will then be replaced by a dummy or other categorical encoding.

    C is also required to change options from their default values.

    https://patsy.readthedocs.io/en/latest/builtins-reference.html#patsy.builtins.C