Search code examples
pythonpandasdata-analysiscategorical-data

Split labels of dataframe with multiple categorical values in python for encoding labels


I have a column like this in a dataset.

print (pharma_data['Treated_with_drugs'].astype('category').cat.categories)
Index(['DX1 ', 'DX1 DX2 ', 'DX1 DX2 DX3 ', 'DX1 DX2 DX3 DX4 ',
       'DX1 DX2 DX3 DX4 DX5 ', 'DX1 DX2 DX3 DX5 ', 'DX1 DX2 DX4 ',
       'DX1 DX2 DX4 DX5 ', 'DX1 DX2 DX5 ', 'DX1 DX3 ', 'DX1 DX3 DX4 ',
       'DX1 DX3 DX4 DX5 ', 'DX1 DX3 DX5 ', 'DX1 DX4 ', 'DX1 DX4 DX5 ',
       'DX1 DX5 ', 'DX2 ', 'DX2 DX3 ', 'DX2 DX3 DX4 ', 'DX2 DX3 DX4 DX5 ',
       'DX2 DX3 DX5 ', 'DX2 DX4 ', 'DX2 DX4 DX5 ', 'DX2 DX5 ', 'DX3 ',
       'DX3 DX4 ', 'DX3 DX4 DX5 ', 'DX3 DX5 ', 'DX4 ', 'DX4 DX5 ', 'DX5 ',
       'DX6'],
      dtype='object')

I want to split that column into 6 columns as: DX1, DX2, DX3, DX4, DX5, DX6 with values as 0 or 1.

For example, if row value is 'DX1 DX2 DX5 ' then,

column names: DX1, DX2, DX3, DX4, DX5, DX6

column values: 1 1 0 0 1 0

How can I do that?


Solution

  • Use Series.str.strip with Series.str.get_dummies:

    a = ['DX1 ', 'DX1 DX2 ', 'DX1 DX2 DX3 ', 'DX1 DX2 DX3 DX4 ',
           'DX1 DX2 DX3 DX4 DX5 ', 'DX1 DX2 DX3 DX5 ', 'DX1 DX2 DX4 ',
           'DX1 DX2 DX4 DX5 ', 'DX1 DX2 DX5 ', 'DX1 DX3 ', 'DX1 DX3 DX4 ',
           'DX1 DX3 DX4 DX5 ', 'DX1 DX3 DX5 ', 'DX1 DX4 ', 'DX1 DX4 DX5 ',
           'DX1 DX5 ', 'DX2 ', 'DX2 DX3 ', 'DX2 DX3 DX4 ', 'DX2 DX3 DX4 DX5 ',
           'DX2 DX3 DX5 ', 'DX2 DX4 ', 'DX2 DX4 DX5 ', 'DX2 DX5 ', 'DX3 ',
           'DX3 DX4 ', 'DX3 DX4 DX5 ', 'DX3 DX5 ', 'DX4 ', 'DX4 DX5 ', 'DX5 ',
           'DX6']
    
    pharma_data = pd.DataFrame({'Treated_with_drugs':a})
    

    df = pharma_data['Treated_with_drugs'].str.strip().str.get_dummies(' ')
    
    print (df)
        DX1  DX2  DX3  DX4  DX5  DX6
    0     1    0    0    0    0    0
    1     1    1    0    0    0    0
    2     1    1    1    0    0    0
    3     1    1    1    1    0    0
    4     1    1    1    1    1    0
    5     1    1    1    0    1    0
    6     1    1    0    1    0    0
    7     1    1    0    1    1    0
    8     1    1    0    0    1    0
    9     1    0    1    0    0    0
    10    1    0    1    1    0    0
    11    1    0    1    1    1    0
    12    1    0    1    0    1    0
    13    1    0    0    1    0    0
    14    1    0    0    1    1    0
    15    1    0    0    0    1    0
    16    0    1    0    0    0    0
    17    0    1    1    0    0    0
    18    0    1    1    1    0    0
    19    0    1    1    1    1    0
    20    0    1    1    0    1    0
    21    0    1    0    1    0    0
    22    0    1    0    1    1    0
    23    0    1    0    0    1    0
    24    0    0    1    0    0    0
    25    0    0    1    1    0    0
    26    0    0    1    1    1    0
    27    0    0    1    0    1    0
    28    0    0    0    1    0    0
    29    0    0    0    1    1    0
    30    0    0    0    0    1    0
    31    0    0    0    0    0    1