One-hot columns to a new column of string list

Let's suppose I have this dataframe:

        a   b   c   d
i    
0       1   0   0   0   
2       0   0   0   0
4       0   1   1   0

I wish to add a new column 'class' to sum up which class each item in the index is in, as a list of strings:

        a   b   c   d   class
i    
0       1   0   0   0   ['a']
2       0   0   0   0    NaN
4       0   1   1   0   ['b','c']

How can I do this in a robust (deal with NaN and multi-classes) and efficient way?

For now I transformed each column value as type bool and multiplied by its column name in an apply function but: it doesn't deal well with multi-class nor NaN, and it's obviously not optimal.

Thanks for your help!

Solution

You can use numpy.where to get the indices of wherever 1 occurs. From there, your column indices represent the label and the row indices are used for alignment. This code worked for me:

# Allocate our output first to fill nans into rows who have no labels
out = pd.Series(np.nan, index=df.index, dtype=object)

for i, j in zip(*np.where(df)):
    i = df.index[i]             # Extract dataframe index label instead of integer position
    label = df.columns[j]       # Extract relevant class label

    if pd.isnull(out[i]):       # If the current value in `out` is null, make a list with the class label
        out[i] = [label]
    else:
        out[i].append(label)    # If there is already a label in the out[i] cell, append to it

df["class"] = out

print(df)
   a  b  c  d   class
i                    
0  1  0  0  0     [a]
2  0  0  0  0     NaN
4  0  1  1  0  [b, c]