python pandas data-science dummy-variable

Dummy variable trap, does it matter which dummy column I delete?

I've just learned about dummy variables and about it's trap. So let's assume I have a categorical column with 3 categories in it, for example:

Dog
Cat
Bear

I split it to 3 separated columns, IsDog, IsCat, IsBear with 0/1 in it, so I can use it in my model. But they say number of dummy columns should always be (number_of_categories - 1). So should I delete the last one (in this case IsBear), or it actually doesn't matter, just take away random one?

Solution

You can have Pandas do it automatically for you, for each categorical column, as follows.

Note that it will automatically prefix the new column name (e.g. categorical variable 'Dog' with 3 categories will give two new columns: Dog_1, Dog_2), and it will drop the original column ('Dog'). With the drop first option it will give you k-1 dummy columns for a column with k categories (i.e. The column Dog with 3 categories becomes 2 – not 3 – dummy columns, as above, and as per your question).

df = pd.get_dummies(df, columns=['cat_var_1', 'cat_var_2'], drop_first=True)