My Dataset is as following:
Symptoms (X) :: Condition (Y)
fever, headache, blindnes :: wagner syndrom
tooth pain,fever , sweet urine :: buri buri diseases
blindness,nose bleed,fever :: Taylor syndrome
where X are the features and Y are my labels. i would like to encode X into one-hot-encoding matrix. panda's get_dummies cant handle multiple values in one column but if i will split X into multiple columns i will lose the ability to encode the symptoms to the same one-hot matrix
any ideas?
You could do this with Sklearn CountVectoriser, each word is a column, row an observation. If you set the binary tag to true, for each row if the word is present it will be represented as a 1 for that row|column. Set binary to False and its the number of times that word is present in the sentence.