Search code examples
pythonpython-3.xpandaslistone-hot-encoding

Encode multiple label in DataFrame


Given a list of list, in which each sublist is a bucket filled with letters, like:

L=[['a','c'],['b','e'],['d']]

I would like to encode each sublist as one row in my DataFrame like this:

    a   b   c   d   e
0   1   0   1   0   0
1   0   1   0   0   1
2   0   0   0   1   0

Let's assume the letter is just from 'a' to 'e'. I am wondering how to complete a function to do so.


Solution

  • You can use the sklearn library:

    import pandas as pd
    from sklearn.preprocessing import MultiLabelBinarizer
    
    L = [['a', 'c'], ['b', 'e'], ['d']]
    
    mlb = MultiLabelBinarizer()
    
    res = pd.DataFrame(mlb.fit_transform(L),
                       columns=mlb.classes_)
    
    print(res)
    
       a  b  c  d  e
    0  1  0  1  0  0
    1  0  1  0  0  1
    2  0  0  0  1  0