For String features when the order doesn't matter, what is better get dummies or oneHotEncoder?
For example, on this pandas data frame:
df_with_cat = pd.DataFrame({'A': ['ios', 'android', 'web', 'NaN', 'ios','ios', 'NaN', 'android'], 'B' : [4, 4, 'NaN', 2, 'NaN', 3, 3, 'NaN']})
df_with_cat.head()
A B
---------------
0 ios 4
1 android 4
2 web NaN
3 NaN 2
4 ios NaN
5 ios 3
6 NaN 3
7 android NaN
I know that now in order to handle them (Impute the missing values etc.) I have to encode them, something like this:
from sklearn.preprocessing import LabelEncoder
df_with_cat_orig = df_with_cat.copy()
la_encoder = LabelEncoder()
df_with_cat['A'] = la_encoder.fit_transform(df_with_cat.A)
Output:
df_with_cat.head(10)
A B
-----------
0 2 4
1 1 4
2 3 NaN
3 0 2
4 2 NaN
5 2 3
6 0 3
7 1 NaN
But now it seems like there is some order 0-3 but this is not the case... 'ios' ->2
is not necessarily greater than 'android' ->1
I just got an answer to my question above (and related to the yellow marked below):
When you encode them to numbers and leave them all as a single feature, the model is assuming that the order means something, for that matter that 'ios' (which is mapped to 2) is greater than 'android' (which is equal to 1)
But now it seems like there is some order 0-3 but this is not the case... 'ios' ->2 is not necessarily greater than 'android' ->1
If for the specific feature there are not too many categories it's easy to use on them get dummies:
data_with_dummies = pd.get_dummies(df_with_cat, columns=['A'], drop_first=True)
B A_1 A_2 A_3
------------------------
0 4 0 1 0
1 4 1 0 0
2 NaN 0 0 1
3 2 0 0 0
4 NaN 0 1 0
5 3 0 1 0
6 3 0 0 0
7 NaN 1 0
Now we are avoiding the problem I stated to begin with, this should improve the model's performance significantly
Or just use OneHotEncoder - as @Primusa stated in the answer above