Search code examples
pythonlinear-regressionstatsmodelscategorical-data

Specifying which category to treat as the base with 'statsmodels'


In understand that when I have a category variable in a model passed to a statsmodels fit that dummy variables will automatically be generated for the categories. For example if I have a variable 'Location' with values 'IndianOcean', 'Thailand', 'China' and 'Mars' I will get variables in my model of the form

Location[T.Thailand]

with one of the value not represented. By default the excluded variable seems to be the least common one. Is there a way to specify — ideally within the model specification — which value is treated as the "base value" and excluded?


Solution

  • You can pass a reference arg to the Treatment contrast, using syntax like

    "y ~ C(Location, Treatment(reference='China'))"

    http://patsy.readthedocs.org/en/latest/API-reference.html#patsy.Treatment

    If you have a better suggestion for naming conventions please file an issue with patsy.