Search code examples
pythonpandasdataframeone-hot-encoding

JSON array to one hot encoding in pandas


Let's say I have a pandas dataframe that looks like the following:

car              colors
corvette         {"colors": ["red", "black"]}
forester         {"colors": ["white", "silver", "black"]}

I'd like to one hot encode the colors of each car like so:

car        black    red   white  silver  black
corvette       1      1       0       0      0
forester       1      0       1       1      0

What's a nice elegant way to accomplish this?


Solution

  • Try this:

    (df.drop('colors', axis=1)
       .join(pd.get_dummies(pd.DataFrame.from_records(df.colors.values)
                      ['colors'].explode())
                .sum(level=0)
            )
    )
    

    Output:

            car  black  red  silver  white
    0  corvette      1    1       0      0
    1  forester      1    0       1      1