I have a pandas dataframe:
import pandas as pd
d={'col1':[[1,2,3],[4,5,6]],'col2':[[7,8,9],[10,11,12]]}
df=pd.DataFrame(d)
which results in:
however I want to implement a onHotEncoder, which will treat each list with the cells of the dataFrame as a string, and I want it to treat each value independently.
How would I implement this? My actual dataFrame contains lists of 500 items, and has 4000 unique values.
I think you can use stack
for creating Series
, then cast list
to string
by astype
, remove []
by strip
and last call get_dummies
:
df = df.stack().astype(str).str.strip('[]').str.get_dummies(sep=', ')
print (df)
1 10 11 12 2 3 4 5 6 7 8 9
0 col1 1 0 0 0 1 1 0 0 0 0 0 0
col2 0 0 0 0 0 0 0 0 0 1 1 1
1 col1 0 0 0 0 0 0 1 1 1 0 0 0
col2 0 1 1 1 0 0 0 0 0 0 0 0
One column only:
df = df['col1'].astype(str).str.strip('[]').str.get_dummies(sep=', ')
print (df)
1 2 3 4 5 6
0 1 1 1 0 0 0
1 0 0 0 1 1 1