Search code examples
pythonpandasscikit-learnsklearn-pandas

How to rank the categorical values while one-hot-encoding


I have the data like this.

id feature_1 feature_2
1 a e
2 b c
3 c d
4 d b
5 e a

I want the one-hot-encoded like feature with the first column representing 1 and the second column representing 0.5. Like the following table.

id a b c d e
1 1 0 0 0 0.5
2 0 1 0.5 0 0
3 0 0 1 0.5 0
4 0 0.5 0 1 0
5 0.5 0 0 0 1

But when applying sklearn.preprocessing.OneHotEncoder it outputs 10 columns with respective 1s.

How can I achieve this?


Solution

  • For the two columns, you can do:

    pd.crosstab(df.id, df.feature_1) + pd.crosstab(df['id'], df['feature_2']) * .5
    

    Output:

    feature_1    a    b    c    d    e
    id                                
    1          1.0  0.0  0.0  0.0  0.5
    2          0.0  1.0  0.5  0.0  0.0
    3          0.0  0.0  1.0  0.5  0.0
    4          0.0  0.5  0.0  1.0  0.0
    5          0.5  0.0  0.0  0.0  1.0
    

    If you have more than two features, with the weights defined, then you can melt then map the features to the weights:

    weights = {'feature_1':1, 'feature_2':0.5}
    flatten = df.melt('id')
    
    (flatten['variable'].map(weights)
         .groupby([flattern['id'], flatten['value']])
         .sum().unstack('value', fill_value=0)
    )