Search code examples
pythonpandasscikit-learnone-hot-encoding

How to apply onehot encoder over vectorized dataframe columns?


Suppose that we have this data frame:

ID CATEGORIES
0 ['A']
1 ['A', 'C']
2 ['B', 'C']

And I want to apply one hot encoder to categories column. The result I want is

ID A B C
0 1 0 0
1 1 0 1
2 0 1 1

I know it can be easily codded. I just want to know if this function is already implemented in some package. Code it in python will probably result in a quite slow function.

(i needed to put the tables in code fields because stackoverflow was not allowing me to post it as tables)


Solution

  • You can use str.join combined with str.get_dummies:

    out = df[['ID']].join(df['CATEGORIES'].str.join('|').str.get_dummies())
    

    Output:

       ID  A  B  C
    0   0  1  0  0
    1   1  1  0  1
    2   2  0  1  1
    

    used input:

    df = pd.DataFrame({'ID': [0, 1, 2],
                       'CATEGORIES': [['A'], ['A', 'C'], ['B', 'C']]})
    

    There are many other alternatives, using pivot, crosstab, etc.

    One example:

    df2 = df.explode('CATEGORIES')
    
    out = pd.crosstab(df2['ID'], df2['CATEGORIES']).reset_index()