Search code examples
pythonpandasone-hot-encoding

How to one-hot encode a dataframe where each row has lists


I am trying to feed in data that has lists within lists of data to a machine learning algorithm:

for example a patient may have several medications and several responses to the medication they may also have names. So if they take more than 1 medication it will come as a list of 2 or more. They only have one name.

I believe one-hot encoding is the correct way to do so.

Here is what I have done so far:

I have a dataframe:

df = pandas.DataFrame([{'drug': ['drugA','drugB'], 'patient': 'john'}, {'drug': ['drugC','drugD'], 'patient': 'angel'}])

             drug patient
0  [drugA, drugB]    john
1  [drugC, drugD]   angel

I want to get something like:

  drugA  drugB drugC drugD patient
0  1       1     0     0     john
0  0       0     1     1     angel

I tried this:

pandas.get_dummies(df.apply(pandas.Series).stack()).sum(level=0)

But got:

TypeError: unhashable type: 'list'

Solution

  • Drawing heavily on this answer, here's one approach:

    df = pd.DataFrame([{'drug': ['drugA','drugB'], 'patient': 'john'}, 
                       {'drug': ['drugC','drugD'], 'patient': 'angel'}])
    s = df.drug
          .apply(lambda x: pd.Series(x))
          .unstack()
    df2 = df.join(pd.DataFrame(s.reset_index(level=0, drop=True)))
            .drop('drug',1)
            .rename(columns={0:'drug'})
    df2.merge(pd.get_dummies(df2.drug), left_index=True, right_index=True)
       .drop('drug',1)
    

    Output:

      patient  drugA  drugB  drugC  drugD
    0    john    1.0    0.0    0.0    0.0
    0    john    0.0    1.0    0.0    0.0
    0    john    1.0    0.0    0.0    0.0
    0    john    0.0    1.0    0.0    0.0
    1   angel    0.0    0.0    1.0    0.0
    1   angel    0.0    0.0    0.0    1.0
    1   angel    0.0    0.0    1.0    0.0
    1   angel    0.0    0.0    0.0    1.0