Search code examples
pythonpandasscikit-learnclassificationone-hot-encoding

How to use the output from OneHotEncoder in sklearn?


I have a Pandas Dataframe with 2 categorical variables, and ID variable and a target variable (for classification). I managed to convert the categorical values with OneHotEncoder. This results in a sparse matrix.

ohe = OneHotEncoder()
# First I remapped the string values in the categorical variables to integers as OneHotEncoder needs integers as input
... remapping code ...

ohe.fit(df[['col_a', 'col_b']])
ohe.transform(df[['col_a', 'col_b']])

But I have no clue how I can use this sparse matrix in a DecisionTreeClassifier? Especially when I want to add some other non-categorical variables in my dataframe later on. Thanks!

EDIT In reply to the comment of miraculixx: I also tried the DataFrameMapper in sklearn-pandas

mapper = DataFrameMapper([
    ('id_col', None),
    ('target_col', None),
    (['col_a'], OneHotEncoder()),
    (['col_b'], OneHotEncoder())
])

t = mapper.fit_transform(df)

But then I get this error:

TypeError: no supported conversion for types : (dtype('O'), dtype('int64'), dtype('float64'), dtype('float64')).


Solution

  • I see you are already using Pandas, so why not using its get_dummies function?

    import pandas as pd
    df = pd.DataFrame([['rick','young'],['phil','old'],['john','teenager']],columns=['name','age-group'])
    

    result

       name age-group
    0  rick     young
    1  phil       old
    2  john  teenager
    

    now you encode with get_dummies

    pd.get_dummies(df)
    

    result

    name_john  name_phil  name_rick  age-group_old  age-group_teenager  \
    0          0          0          1              0                   0   
    1          0          1          0              1                   0   
    2          1          0          0              0                   1   
    
       age-group_young  
    0                1  
    1                0  
    2                0
    

    And you can actually use the new Pandas DataFrame in your Sklearn's DecisionTreeClassifier.