Search code examples
python-3.xpandasmachine-learningscikit-learnfeature-engineering

In Machine learning, What is the best way to encode non hierarchic categorial features?


For String features when the order doesn't matter, what is better get dummies or oneHotEncoder?

For example, on this pandas data frame:

df_with_cat = pd.DataFrame({'A': ['ios', 'android', 'web', 'NaN', 'ios','ios', 'NaN', 'android'], 'B' : [4, 4, 'NaN', 2, 'NaN', 3, 3, 'NaN']})

df_with_cat.head()

    A        B
---------------
0   ios      4
1   android  4
2   web      NaN
3   NaN      2
4   ios      NaN
5   ios      3
6   NaN      3
7   android  NaN

I know that now in order to handle them (Impute the missing values etc.) I have to encode them, something like this:

from sklearn.preprocessing import LabelEncoder

df_with_cat_orig = df_with_cat.copy()
la_encoder = LabelEncoder()
df_with_cat['A'] = la_encoder.fit_transform(df_with_cat.A)

Output:

df_with_cat.head(10)

    A   B
-----------
0   2   4
1   1   4
2   3   NaN
3   0   2
4   2   NaN
5   2   3
6   0   3
7   1   NaN

But now it seems like there is some order 0-3 but this is not the case... 'ios' ->2 is not necessarily greater than 'android' ->1


Solution

  • I just got an answer to my question above (and related to the yellow marked below):

    When you encode them to numbers and leave them all as a single feature, the model is assuming that the order means something, for that matter that 'ios' (which is mapped to 2) is greater than 'android' (which is equal to 1)

    But now it seems like there is some order 0-3 but this is not the case... 'ios' ->2 is not necessarily greater than 'android' ->1

    If for the specific feature there are not too many categories it's easy to use on them get dummies:

    data_with_dummies = pd.get_dummies(df_with_cat, columns=['A'], drop_first=True)
    
    
            B A_1 A_2   A_3
    ------------------------
        0   4   0   1   0
        1   4   1   0   0
        2   NaN 0   0   1
        3   2   0   0   0
        4   NaN 0   1   0
        5   3   0   1   0
        6   3   0   0   0
        7   NaN 1   0   
    

    Now we are avoiding the problem I stated to begin with, this should improve the model's performance significantly

    Or just use OneHotEncoder - as @Primusa stated in the answer above