I have panel data from an experiment that looks like this:
account | usage | yearmonth | pre/post | group |
---|---|---|---|---|
1 | 121 | oct 2019 | pre | control |
1 | 124 | Nov 2019 | post | control |
2 | 120 | oct 2019 | pre | treatment |
2 | 118 | nov 2019 | post | treatment |
In my data I have about 50 months and a lot more accounts.
I'm using the statsmodel formula/patsy to run an ols to evaluate the results.
This isn't the exact model specification I'm using but for the sake of the question:
smf.ols("usage ~ C(group, Treatment('control'))* C(pre/post, Treatment(pre)) + yearmonth), df).fit()
My question is, when I include the 'yearmonth' variable in my formula, does statsmodel treat it as a dummy variable or do I need to use pd.get_dummies on it first then use this model:
smf.ols("usage ~ C(group, Treatment('control'))* C(pre/post, Treatment(pre)) + oct 2019 + nov 2019), df).fit()
If I was to use the latter, my formula is going to be super long. So do I need to do it that way?
Thanks!
I believe the default categorical encoding is Treatment
if the column is a text column. In this case Treatement
returns K-1
categories, such that one of your yearmonth
values will be considered the baseline, and you will see coefficients for all of the other dates except that one. You can see this in detail here.
pd.get_dummies
by default does not act this way. It will created columns for every categorical value, meaning you will have one additional column using this method.
If you wished to use pd.get_dummies
you would need to set drop_first=True
paramaeter. You can find the documentation here
In short there's nothing wrong w/the first approach as it is in fact getting dummies behind the scenes, it's just worth noting it's K-1
, not K
categories.