python pandas scikit-learn one-hot-encoding

How to apply onehot encoder over vectorized dataframe columns?

Suppose that we have this data frame:

ID	CATEGORIES
0	['A']
1	['A', 'C']
2	['B', 'C']

And I want to apply one hot encoder to categories column. The result I want is

ID	A	B	C
0	1	0	0
1	1	0	1
2	0	1	1

I know it can be easily codded. I just want to know if this function is already implemented in some package. Code it in python will probably result in a quite slow function.

(i needed to put the tables in code fields because stackoverflow was not allowing me to post it as tables)

Solution

You can use str.join combined with str.get_dummies:

out = df[['ID']].join(df['CATEGORIES'].str.join('|').str.get_dummies())

Output:

   ID  A  B  C
0   0  1  0  0
1   1  1  0  1
2   2  0  1  1

used input:

df = pd.DataFrame({'ID': [0, 1, 2],
                   'CATEGORIES': [['A'], ['A', 'C'], ['B', 'C']]})

There are many other alternatives, using pivot, crosstab, etc.

One example:

df2 = df.explode('CATEGORIES')

out = pd.crosstab(df2['ID'], df2['CATEGORIES']).reset_index()