Search code examples
pythonlinear-regressionstatsmodelsdummy-variablepanel-data

Is pd.get_dummies the same as simply including categorical variable in statsmodel ols?


I have panel data from an experiment that looks like this:

account usage yearmonth pre/post group
1 121 oct 2019 pre control
1 124 Nov 2019 post control
2 120 oct 2019 pre treatment
2 118 nov 2019 post treatment

In my data I have about 50 months and a lot more accounts.

I'm using the statsmodel formula/patsy to run an ols to evaluate the results.

This isn't the exact model specification I'm using but for the sake of the question:

smf.ols("usage ~ C(group, Treatment('control'))* C(pre/post, Treatment(pre)) + yearmonth), df).fit()

My question is, when I include the 'yearmonth' variable in my formula, does statsmodel treat it as a dummy variable or do I need to use pd.get_dummies on it first then use this model:

 smf.ols("usage ~ C(group, Treatment('control'))* C(pre/post, Treatment(pre)) + oct 2019 + nov 2019), df).fit()

If I was to use the latter, my formula is going to be super long. So do I need to do it that way?

Thanks!


Solution

  • I believe the default categorical encoding is Treatment if the column is a text column. In this case Treatement returns K-1 categories, such that one of your yearmonth values will be considered the baseline, and you will see coefficients for all of the other dates except that one. You can see this in detail here.

    pd.get_dummies by default does not act this way. It will created columns for every categorical value, meaning you will have one additional column using this method.

    If you wished to use pd.get_dummies you would need to set drop_first=True paramaeter. You can find the documentation here

    In short there's nothing wrong w/the first approach as it is in fact getting dummies behind the scenes, it's just worth noting it's K-1, not K categories.