Search code examples
pythonpandasone-hot-encoding

One-hot columns to a new column of string list


Let's suppose I have this dataframe:

        a   b   c   d
i    
0       1   0   0   0   
2       0   0   0   0
4       0   1   1   0

I wish to add a new column 'class' to sum up which class each item in the index is in, as a list of strings:

        a   b   c   d   class
i    
0       1   0   0   0   ['a']
2       0   0   0   0    NaN
4       0   1   1   0   ['b','c']

How can I do this in a robust (deal with NaN and multi-classes) and efficient way?

For now I transformed each column value as type bool and multiplied by its column name in an apply function but: it doesn't deal well with multi-class nor NaN, and it's obviously not optimal.

Thanks for your help!


Solution

  • You can use numpy.where to get the indices of wherever 1 occurs. From there, your column indices represent the label and the row indices are used for alignment. This code worked for me:

    # Allocate our output first to fill nans into rows who have no labels
    out = pd.Series(np.nan, index=df.index, dtype=object)
    
    for i, j in zip(*np.where(df)):
        i = df.index[i]             # Extract dataframe index label instead of integer position
        label = df.columns[j]       # Extract relevant class label
    
        if pd.isnull(out[i]):       # If the current value in `out` is null, make a list with the class label
            out[i] = [label]
        else:
            out[i].append(label)    # If there is already a label in the out[i] cell, append to it
    
    df["class"] = out
    
    print(df)
       a  b  c  d   class
    i                    
    0  1  0  0  0     [a]
    2  0  0  0  0     NaN
    4  0  1  1  0  [b, c]