python pandas scikit-learn sklearn-pandas

LabelEncoder().fit_transform vs. pd.get_dummies for categorical coding

It was recently brought to my attention that if you have a dataframe df like this:

   A      B   C
0  0   Boat  45
1  1    NaN  12
2  2    Cat   6
3  3  Moose  21
4  4   Boat  43

You can encode the categorical data automatically with pd.get_dummies:

df1 = pd.get_dummies(df)

Which yields this:

   A   C  B_Boat  B_Cat  B_Moose
0  0  45     1.0    0.0      0.0
1  1  12     0.0    0.0      0.0
2  2   6     0.0    1.0      0.0
3  3  21     0.0    0.0      1.0
4  4  43     1.0    0.0      0.0

I typically use LabelEncoder().fit_transform for this sort of task before putting it in pd.get_dummies, but if I can skip a few steps that'd be desirable.

Am I losing anything by simply using pd.get_dummies on my entire dataframe to encode it?

Solution

Yes, you can skip the use of LabelEncoder if you only want to encode string features. On the other hand if you have a categorical column of integers (instead of strings) then pd.get_dummies will leave as it is (see your A or C column for example). In that case you should use OneHotEncoder. Ideally OneHotEncoder would support both integer and strings but this is being worked on at the moment.