I am performing a one-way ANOVA in the following code:
results = ols('price ~ C(make)', data=df_anova).fit()
print(results.summary())
What is the function of "C" before the categorical variable "make" (those are 22 car brands)? I don't really see something is changing when leaving the C out. This webpage (https://pythonfordatascience.org/anova-python/) states that it automatically assigns a dummy variable to your categories, excludes one of the categories and captures it as the intercept in order to make proper comparison relative to the excluded brand. But as I stated earlier, when not including the C in front of the categorical variable, nothing seems to alter.
The formulas in statamodels are handled by patsy.
C(x)
requires that x
is treated as a categorical variable. If the values are strings, then patsy always treats the variable as a categorical variable and C
is redundant in that case.
C
forces numeric values like integers to be treated as categorical, which will then be replaced by a dummy or other categorical encoding.
C
is also required to change options from their default values.
https://patsy.readthedocs.io/en/latest/builtins-reference.html#patsy.builtins.C