Search code examples
pythonpandasscikit-learnsklearn-pandas

LabelEncoder().fit_transform vs. pd.get_dummies for categorical coding


It was recently brought to my attention that if you have a dataframe df like this:

   A      B   C
0  0   Boat  45
1  1    NaN  12
2  2    Cat   6
3  3  Moose  21
4  4   Boat  43

You can encode the categorical data automatically with pd.get_dummies:

df1 = pd.get_dummies(df)

Which yields this:

   A   C  B_Boat  B_Cat  B_Moose
0  0  45     1.0    0.0      0.0
1  1  12     0.0    0.0      0.0
2  2   6     0.0    1.0      0.0
3  3  21     0.0    0.0      1.0
4  4  43     1.0    0.0      0.0

I typically use LabelEncoder().fit_transform for this sort of task before putting it in pd.get_dummies, but if I can skip a few steps that'd be desirable.

Am I losing anything by simply using pd.get_dummies on my entire dataframe to encode it?


Solution

  • Yes, you can skip the use of LabelEncoder if you only want to encode string features. On the other hand if you have a categorical column of integers (instead of strings) then pd.get_dummies will leave as it is (see your A or C column for example). In that case you should use OneHotEncoder. Ideally OneHotEncoder would support both integer and strings but this is being worked on at the moment.