Search code examples
pythonpandasmachine-learningcategorical-data

How to handle multi-select data for machine learning on python pandas


One of my features is from a question in the form of "select all that apply". This means each entry has multiple values separated by commas like:

enter image description here

and so on. I need to convert this to numerical data so I can use it for my machine learning model. Something similar to what OneHotEncoder does. How do I handle this kind of data

EDIT:

Here is what I imagine the results to look like

enter image description here


Solution

  • You want Series.str.get_dummies then use DataFrame.add_prefix to get your desired column names:

    df['Feature'].str.get_dummies(sep=',').add_prefix('feature_')
    
       feature_option1  feature_option2  feature_option3  feature_option4
    0                1                0                1                0
    1                0                0                0                1
    2                0                1                1                0