python linear-regression statsmodels dummy-variable panel-data

Is pd.get_dummies the same as simply including categorical variable in statsmodel ols?

I have panel data from an experiment that looks like this:

account	usage	yearmonth	pre/post	group
1	121	oct 2019	pre	control
1	124	Nov 2019	post	control
2	120	oct 2019	pre	treatment
2	118	nov 2019	post	treatment

In my data I have about 50 months and a lot more accounts.

I'm using the statsmodel formula/patsy to run an ols to evaluate the results.

This isn't the exact model specification I'm using but for the sake of the question:

smf.ols("usage ~ C(group, Treatment('control'))* C(pre/post, Treatment(pre)) + yearmonth), df).fit()

My question is, when I include the 'yearmonth' variable in my formula, does statsmodel treat it as a dummy variable or do I need to use pd.get_dummies on it first then use this model:

 smf.ols("usage ~ C(group, Treatment('control'))* C(pre/post, Treatment(pre)) + oct 2019 + nov 2019), df).fit()

If I was to use the latter, my formula is going to be super long. So do I need to do it that way?

Thanks!

Solution

I believe the default categorical encoding is Treatment if the column is a text column. In this case Treatement returns K-1 categories, such that one of your yearmonth values will be considered the baseline, and you will see coefficients for all of the other dates except that one. You can see this in detail here.

pd.get_dummies by default does not act this way. It will created columns for every categorical value, meaning you will have one additional column using this method.

If you wished to use pd.get_dummies you would need to set drop_first=True paramaeter. You can find the documentation here

In short there's nothing wrong w/the first approach as it is in fact getting dummies behind the scenes, it's just worth noting it's K-1, not K categories.