Search code examples
listpandasdataframemultiple-columnsone-hot-encoding

onHotEncoding and lists in a pandas dataFrame


I have a pandas dataframe:

import pandas as pd    
d={'col1':[[1,2,3],[4,5,6]],'col2':[[7,8,9],[10,11,12]]}
df=pd.DataFrame(d)

which results in:

result of comman above

however I want to implement a onHotEncoder, which will treat each list with the cells of the dataFrame as a string, and I want it to treat each value independently.

How would I implement this? My actual dataFrame contains lists of 500 items, and has 4000 unique values.


Solution

  • I think you can use stack for creating Series, then cast list to string by astype, remove [] by strip and last call get_dummies:

    df = df.stack().astype(str).str.strip('[]').str.get_dummies(sep=', ')
    print (df)
            1  10  11  12  2  3  4  5  6  7  8  9
    0 col1  1   0   0   0  1  1  0  0  0  0  0  0
      col2  0   0   0   0  0  0  0  0  0  1  1  1
    1 col1  0   0   0   0  0  0  1  1  1  0  0  0
      col2  0   1   1   1  0  0  0  0  0  0  0  0
    

    One column only:

    df = df['col1'].astype(str).str.strip('[]').str.get_dummies(sep=', ') 
    print (df)
       1  2  3  4  5  6
    0  1  1  1  0  0  0
    1  0  0  0  1  1  1