Search code examples
pythonpandasdataframenumpyone-hot-encoding

One Hot Encoding For Two Different Dataframe Columns


I have a dataframe with id variable as Toy and the different color schemes Toy comes in -

input_data = pd.DataFrame({'Toy': ['Toy1',  'Toy2', 'Toy3','Toy4'],
                      'Color1': ['Red', 'Orange',   '', 'Orange'],
                      'Color2': ['Red', '', 'Blue', 'Red']})

I want to one-hot encode the Color1 and Color2 variables, but have one single transformed variable (color name without any prefix)

output_data = pd.DataFrame({'Toy': ['Toy1', 'Toy2', 'Toy3', 'Toy4'],
            'Red': [1,  0,  0,  1],
            'Blue': [0, 0,  1,  0],
            'Orange': [0,   1,  0,  1]})

This seems to be a quick and easy output but I am not able to find a straight forward way of doing it. Any leads are really appreciated.


Solution

  • Use Series.str.get_dummies with joine values by |:

    df = input_data.set_index('Toy').agg('|'.join, 1).str.get_dummies().reset_index()
    print (df)
        Toy  Blue  Orange  Red
    0  Toy1     0       0    1
    1  Toy2     0       1    0
    2  Toy3     1       0    0
    3  Toy4     0       1    1