I am trying to feed in data that has lists within lists of data to a machine learning algorithm:
for example a patient may have several medications and several responses to the medication they may also have names. So if they take more than 1 medication it will come as a list of 2 or more. They only have one name.
I believe one-hot encoding is the correct way to do so.
Here is what I have done so far:
I have a dataframe:
df = pandas.DataFrame([{'drug': ['drugA','drugB'], 'patient': 'john'}, {'drug': ['drugC','drugD'], 'patient': 'angel'}])
drug patient
0 [drugA, drugB] john
1 [drugC, drugD] angel
I want to get something like:
drugA drugB drugC drugD patient
0 1 1 0 0 john
0 0 0 1 1 angel
I tried this:
pandas.get_dummies(df.apply(pandas.Series).stack()).sum(level=0)
But got:
TypeError: unhashable type: 'list'
Drawing heavily on this answer, here's one approach:
df = pd.DataFrame([{'drug': ['drugA','drugB'], 'patient': 'john'},
{'drug': ['drugC','drugD'], 'patient': 'angel'}])
s = df.drug
.apply(lambda x: pd.Series(x))
.unstack()
df2 = df.join(pd.DataFrame(s.reset_index(level=0, drop=True)))
.drop('drug',1)
.rename(columns={0:'drug'})
df2.merge(pd.get_dummies(df2.drug), left_index=True, right_index=True)
.drop('drug',1)
Output:
patient drugA drugB drugC drugD
0 john 1.0 0.0 0.0 0.0
0 john 0.0 1.0 0.0 0.0
0 john 1.0 0.0 0.0 0.0
0 john 0.0 1.0 0.0 0.0
1 angel 0.0 0.0 1.0 0.0
1 angel 0.0 0.0 0.0 1.0
1 angel 0.0 0.0 1.0 0.0
1 angel 0.0 0.0 0.0 1.0