Search code examples
pandasmachine-learningcategorical-dataone-hot-encoding

One hot encoding for multi level categorical data-set


My Dataset is as following:

Symptoms (X) :: Condition (Y)
fever, headache, blindnes :: wagner syndrom
tooth pain,fever , sweet urine :: buri buri diseases
blindness,nose bleed,fever :: Taylor syndrome

where X are the features and Y are my labels. i would like to encode X into one-hot-encoding matrix. panda's get_dummies cant handle multiple values in one column but if i will split X into multiple columns i will lose the ability to encode the symptoms to the same one-hot matrix

any ideas?


Solution

  • You could do this with Sklearn CountVectoriser, each word is a column, row an observation. If you set the binary tag to true, for each row if the word is present it will be represented as a 1 for that row|column. Set binary to False and its the number of times that word is present in the sentence.