Suppose that we have this data frame:
ID | CATEGORIES |
---|---|
0 | ['A'] |
1 | ['A', 'C'] |
2 | ['B', 'C'] |
And I want to apply one hot encoder to categories column. The result I want is
ID | A | B | C |
---|---|---|---|
0 | 1 | 0 | 0 |
1 | 1 | 0 | 1 |
2 | 0 | 1 | 1 |
I know it can be easily codded. I just want to know if this function is already implemented in some package. Code it in python will probably result in a quite slow function.
(i needed to put the tables in code fields because stackoverflow was not allowing me to post it as tables)
You can use str.join
combined with str.get_dummies
:
out = df[['ID']].join(df['CATEGORIES'].str.join('|').str.get_dummies())
Output:
ID A B C
0 0 1 0 0
1 1 1 0 1
2 2 0 1 1
used input:
df = pd.DataFrame({'ID': [0, 1, 2],
'CATEGORIES': [['A'], ['A', 'C'], ['B', 'C']]})
There are many other alternatives, using pivot
, crosstab
, etc.
One example:
df2 = df.explode('CATEGORIES')
out = pd.crosstab(df2['ID'], df2['CATEGORIES']).reset_index()