Search code examples
pythonpandassklearn-pandasone-hot-encoding

Converting a Pandas Dataframe column into one hot labels


I have a pandas dataframe similar to this:

  Col1   ABC
0  XYZ    A
1  XYZ    B
2  XYZ    C

By using the pandas get_dummies() function on column ABC, I can get this:

  Col1   A   B   C
0  XYZ   1   0   0
1  XYZ   0   1   0
2  XYZ   0   0   1

While I need something like this, where the ABC column has a list / array datatype:

  Col1    ABC
0  XYZ    [1,0,0]
1  XYZ    [0,1,0]
2  XYZ    [0,0,1]

I tried using the get_dummies function and then combining all the columns into the column which I wanted. I found lot of answers explaining how to combine multiple columns as strings, like this: Combine two columns of text in dataframe in pandas/python. But I cannot figure out a way to combine them as a list.

This question introduced the idea of using sklearn's OneHotEncoder, but I couldn't get it to work. How do I one-hot encode one column of a pandas dataframe?

One more thing: All the answers I came across had solutions where the column names had to be manually typed while combining them. Is there a way to use Dataframe.iloc() or splicing mechanism to combine columns into a list?


Solution

  • Here is an example of using sklearn.preprocessing.LabelBinarizer:

    In [361]: from sklearn.preprocessing import LabelBinarizer
    
    In [362]: lb = LabelBinarizer()
    
    In [363]: df['new'] = lb.fit_transform(df['ABC']).tolist()
    
    In [364]: df
    Out[364]:
      Col1 ABC        new
    0  XYZ   A  [1, 0, 0]
    1  XYZ   B  [0, 1, 0]
    2  XYZ   C  [0, 0, 1]
    

    Pandas alternative:

    In [370]: df['new'] = df['ABC'].str.get_dummies().values.tolist()
    
    In [371]: df
    Out[371]:
      Col1 ABC        new
    0  XYZ   A  [1, 0, 0]
    1  XYZ   B  [0, 1, 0]
    2  XYZ   C  [0, 0, 1]