Let's suppose I have this dataframe:
a b c d
i
0 1 0 0 0
2 0 0 0 0
4 0 1 1 0
I wish to add a new column 'class' to sum up which class each item in the index is in, as a list of strings:
a b c d class
i
0 1 0 0 0 ['a']
2 0 0 0 0 NaN
4 0 1 1 0 ['b','c']
How can I do this in a robust (deal with NaN and multi-classes) and efficient way?
For now I transformed each column value as type bool and multiplied by its column name in an apply function but: it doesn't deal well with multi-class nor NaN, and it's obviously not optimal.
Thanks for your help!
You can use numpy.where
to get the indices of wherever 1 occurs. From there, your column indices represent the label and the row indices are used for alignment. This code worked for me:
# Allocate our output first to fill nans into rows who have no labels
out = pd.Series(np.nan, index=df.index, dtype=object)
for i, j in zip(*np.where(df)):
i = df.index[i] # Extract dataframe index label instead of integer position
label = df.columns[j] # Extract relevant class label
if pd.isnull(out[i]): # If the current value in `out` is null, make a list with the class label
out[i] = [label]
else:
out[i].append(label) # If there is already a label in the out[i] cell, append to it
df["class"] = out
print(df)
a b c d class
i
0 1 0 0 0 [a]
2 0 0 0 0 NaN
4 0 1 1 0 [b, c]