It was recently brought to my attention that if you have a dataframe df
like this:
A B C
0 0 Boat 45
1 1 NaN 12
2 2 Cat 6
3 3 Moose 21
4 4 Boat 43
You can encode the categorical data automatically with pd.get_dummies
:
df1 = pd.get_dummies(df)
Which yields this:
A C B_Boat B_Cat B_Moose
0 0 45 1.0 0.0 0.0
1 1 12 0.0 0.0 0.0
2 2 6 0.0 1.0 0.0
3 3 21 0.0 0.0 1.0
4 4 43 1.0 0.0 0.0
I typically use LabelEncoder().fit_transform
for this sort of task before putting it in pd.get_dummies
, but if I can skip a few steps that'd be desirable.
Am I losing anything by simply using pd.get_dummies
on my entire dataframe to encode it?
Yes, you can skip the use of LabelEncoder
if you only want to encode string features. On the other hand if you have a categorical column of integers (instead of strings) then pd.get_dummies
will leave as it is (see your A or C column for example). In that case you should use OneHotEncoder
. Ideally OneHotEncoder
would support both integer and strings but this is being worked on at the moment.