Search code examples
pythonpandasencodingone-hot-encoding

Pandas separate list in row to columns and one hot encode


a client's online form outputs multi-choice selections to a single field separated by commas. is there a way to split out the values into columns and one hot encode them, I'm using pandas?

Data now:

id  | ind
1   | Student, Tourism, Industrial
2   | Residential, Student, Tourism
3   | student, Tourism, Industrial, Education

How I want the data:

id          | ind_student| ind_Tourism| ind_Industrial| ind_Residential|  ind_Education
1           | 1          | 1          | 1             | 0              | 0    
2           | 1          | 1          | 0             | 1              | 0
3           | 1          | 1          | 1             | 0              | 1

thanks for taking a look!


Solution

  • Split your ind columns into words and pivot your dataframe:

    out = df.assign(ind=df['ind'].str.split(', '), dummy=1).explode('ind') \
            .pivot_table('dummy', 'id', 'ind', fill_value=0).add_prefix('ind_')
    print(out)
    
    # Output:
    ind  ind_Education  ind_Industrial  ind_Residential  ind_Student  ind_Tourism  ind_student
    id                                                                                        
    1                0               1                0            1            1            0
    2                0               0                1            1            1            0
    3                1               1                0            0            1            1