Search code examples
machine-learningdummy-variableone-hot-encoding

Do I need to handle Dummy variable trap manually in Regression or sklearn will do it?


I know that we have to one-hot encode categorical data before training machine Learning algorithm. but my question is do we need to remove one column manually or sklearn will do it?


Solution

  • I assume you want to drop one column also for non-binary categorical features to avoid multi-collinearity, which might cause problems for linear models. It is as easy as providing drop_first=True argument to pd.get_dummies(). It seems that sklearn.preprocessing.OneHotEncoder doesn't have a simple interface to do this and anyway its usage is complicated, as categorical features have to be encoded into int's beforehand.