Search code examples
pythonpandasnumpyscikit-learnsklearn-pandas

How to one-hot-encode from a pandas column containing a list?


I would like to break down a pandas column consisting of a list of elements into as many columns as there are unique elements i.e. one-hot-encode them (with value 1 representing a given element existing in a row and 0 in the case of absence).

For example, taking dataframe df

Col1   Col2         Col3
 C      33     [Apple, Orange, Banana]
 A      2.5    [Apple, Grape]
 B      42     [Banana] 

I would like to convert this to:

df

Col1   Col2   Apple   Orange   Banana   Grape
 C      33     1        1        1       0
 A      2.5    1        0        0       1
 B      42     0        0        1       0

How can I use pandas/sklearn to achieve this?


Solution

  • We can also use sklearn.preprocessing.MultiLabelBinarizer:

    Often we want to use sparse DataFrame for the real world data in order to save a lot of RAM.

    Sparse solution (for Pandas v0.25.0+)

    from sklearn.preprocessing import MultiLabelBinarizer
    
    mlb = MultiLabelBinarizer(sparse_output=True)
    
    df = df.join(
                pd.DataFrame.sparse.from_spmatrix(
                    mlb.fit_transform(df.pop('Col3')),
                    index=df.index,
                    columns=mlb.classes_))
    

    result:

    In [38]: df
    Out[38]:
      Col1  Col2  Apple  Banana  Grape  Orange
    0    C  33.0      1       1      0       1
    1    A   2.5      1       0      1       0
    2    B  42.0      0       1      0       0
    
    In [39]: df.dtypes
    Out[39]:
    Col1                object
    Col2               float64
    Apple     Sparse[int32, 0]
    Banana    Sparse[int32, 0]
    Grape     Sparse[int32, 0]
    Orange    Sparse[int32, 0]
    dtype: object
    
    In [40]: df.memory_usage()
    Out[40]:
    Index     128
    Col1       24
    Col2       24
    Apple      16    #  <--- NOTE!
    Banana     16    #  <--- NOTE!
    Grape       8    #  <--- NOTE!
    Orange      8    #  <--- NOTE!
    dtype: int64
    

    Dense solution

    mlb = MultiLabelBinarizer()
    df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('Col3')),
                              columns=mlb.classes_,
                              index=df.index))
    

    Result:

    In [77]: df
    Out[77]:
      Col1  Col2  Apple  Banana  Grape  Orange
    0    C  33.0      1       1      0       1
    1    A   2.5      1       0      1       0
    2    B  42.0      0       1      0       0