Search code examples
pandasdataframeone-hot-encoding

Getting dummies/encoding using multiple columns in pandas


I am trying to get dummies for a variable for which data is split into multiple columns.

Input Data:
       fruit_1  fruit_2 fruit_3 fruit_4 fruit_5
Index                   
person1 Apple   NaN     NaN     NaN     NaN
person2 Apple   Banana  Guava   NaN     NaN
person3 Guava   NaN     NaN     NaN     NaN
person4 Banana  NaN     NaN     NaN     NaN
person5 Apple   Banana  Guava   Kiwi    Mango
person6 Kiwi    Mango   NaN     NaN     NaN

Desired Output:

           Apple    Banana  Guava   Kiwi    Mango
Index                   
person1        1    0       0      0        0
person2        1    1       1      0        0
person3        0    0       1      0        0
person4        0    1       0      0        0
person5        1    1       1      1        1
person6        0    0       0      1        1

In most of the approaches, I have tried the NaN/blank is causing issues as the number of values in each row can be anything from 1 to 5. I am using pandas to do so. Thank you for your help.


Solution

  • Use get_dummies by all columns with aggregate max by duplicated columns names:

    df = pd.get_dummies(df, prefix='', prefix_sep='').groupby(level=0, axis=1).max()
    print (df)
             Apple  Banana  Guava  Kiwi  Mango
    person1      1       0      0     0      0
    person2      1       1      1     0      0
    person3      0       0      1     0      0
    person4      0       1      0     0      0
    person5      1       1      1     1      1
    person6      0       0      0     1      1
    

    Or reshape first by DataFrame.stack, then aggregate max by index, first level:

    df = pd.get_dummies(df.stack()).groupby(level=0).max()
    print (df)
             Apple  Banana  Guava  Kiwi  Mango
    person1      1       0      0     0      0
    person2      1       1      1     0      0
    person3      0       0      1     0      0
    person4      0       1      0     0      0
    person5      1       1      1     1      1
    person6      0       0      0     1      1