I am trying to get dummies for a variable for which data is split into multiple columns.
Input Data:
fruit_1 fruit_2 fruit_3 fruit_4 fruit_5
Index
person1 Apple NaN NaN NaN NaN
person2 Apple Banana Guava NaN NaN
person3 Guava NaN NaN NaN NaN
person4 Banana NaN NaN NaN NaN
person5 Apple Banana Guava Kiwi Mango
person6 Kiwi Mango NaN NaN NaN
Desired Output:
Apple Banana Guava Kiwi Mango
Index
person1 1 0 0 0 0
person2 1 1 1 0 0
person3 0 0 1 0 0
person4 0 1 0 0 0
person5 1 1 1 1 1
person6 0 0 0 1 1
In most of the approaches, I have tried the NaN/blank is causing issues as the number of values in each row can be anything from 1 to 5. I am using pandas to do so. Thank you for your help.
Use get_dummies
by all columns with aggregate max
by duplicated columns names:
df = pd.get_dummies(df, prefix='', prefix_sep='').groupby(level=0, axis=1).max()
print (df)
Apple Banana Guava Kiwi Mango
person1 1 0 0 0 0
person2 1 1 1 0 0
person3 0 0 1 0 0
person4 0 1 0 0 0
person5 1 1 1 1 1
person6 0 0 0 1 1
Or reshape first by DataFrame.stack
, then aggregate max
by index, first level:
df = pd.get_dummies(df.stack()).groupby(level=0).max()
print (df)
Apple Banana Guava Kiwi Mango
person1 1 0 0 0 0
person2 1 1 1 0 0
person3 0 0 1 0 0
person4 0 1 0 0 0
person5 1 1 1 1 1
person6 0 0 0 1 1