Search code examples
pythonpandasdata-sciencedummy-variable

Dummy variable trap, does it matter which dummy column I delete?


I've just learned about dummy variables and about it's trap. So let's assume I have a categorical column with 3 categories in it, for example:

Dog
Cat
Bear

I split it to 3 separated columns, IsDog, IsCat, IsBear with 0/1 in it, so I can use it in my model. But they say number of dummy columns should always be (number_of_categories - 1). So should I delete the last one (in this case IsBear), or it actually doesn't matter, just take away random one?


Solution

  • You can have Pandas do it automatically for you, for each categorical column, as follows.

    Note that it will automatically prefix the new column name (e.g. categorical variable 'Dog' with 3 categories will give two new columns: Dog_1, Dog_2), and it will drop the original column ('Dog'). With the drop first option it will give you k-1 dummy columns for a column with k categories (i.e. The column Dog with 3 categories becomes 2 – not 3 – dummy columns, as above, and as per your question).

    df = pd.get_dummies(df, columns=['cat_var_1', 'cat_var_2'], drop_first=True)