I have a column like this in a dataset.
print (pharma_data['Treated_with_drugs'].astype('category').cat.categories)
Index(['DX1 ', 'DX1 DX2 ', 'DX1 DX2 DX3 ', 'DX1 DX2 DX3 DX4 ',
'DX1 DX2 DX3 DX4 DX5 ', 'DX1 DX2 DX3 DX5 ', 'DX1 DX2 DX4 ',
'DX1 DX2 DX4 DX5 ', 'DX1 DX2 DX5 ', 'DX1 DX3 ', 'DX1 DX3 DX4 ',
'DX1 DX3 DX4 DX5 ', 'DX1 DX3 DX5 ', 'DX1 DX4 ', 'DX1 DX4 DX5 ',
'DX1 DX5 ', 'DX2 ', 'DX2 DX3 ', 'DX2 DX3 DX4 ', 'DX2 DX3 DX4 DX5 ',
'DX2 DX3 DX5 ', 'DX2 DX4 ', 'DX2 DX4 DX5 ', 'DX2 DX5 ', 'DX3 ',
'DX3 DX4 ', 'DX3 DX4 DX5 ', 'DX3 DX5 ', 'DX4 ', 'DX4 DX5 ', 'DX5 ',
'DX6'],
dtype='object')
I want to split that column into 6 columns as: DX1, DX2, DX3, DX4, DX5, DX6 with values as 0 or 1.
For example, if row value is 'DX1 DX2 DX5 ' then,
column names: DX1, DX2, DX3, DX4, DX5, DX6
column values: 1 1 0 0 1 0
How can I do that?
Use Series.str.strip
with Series.str.get_dummies
:
a = ['DX1 ', 'DX1 DX2 ', 'DX1 DX2 DX3 ', 'DX1 DX2 DX3 DX4 ',
'DX1 DX2 DX3 DX4 DX5 ', 'DX1 DX2 DX3 DX5 ', 'DX1 DX2 DX4 ',
'DX1 DX2 DX4 DX5 ', 'DX1 DX2 DX5 ', 'DX1 DX3 ', 'DX1 DX3 DX4 ',
'DX1 DX3 DX4 DX5 ', 'DX1 DX3 DX5 ', 'DX1 DX4 ', 'DX1 DX4 DX5 ',
'DX1 DX5 ', 'DX2 ', 'DX2 DX3 ', 'DX2 DX3 DX4 ', 'DX2 DX3 DX4 DX5 ',
'DX2 DX3 DX5 ', 'DX2 DX4 ', 'DX2 DX4 DX5 ', 'DX2 DX5 ', 'DX3 ',
'DX3 DX4 ', 'DX3 DX4 DX5 ', 'DX3 DX5 ', 'DX4 ', 'DX4 DX5 ', 'DX5 ',
'DX6']
pharma_data = pd.DataFrame({'Treated_with_drugs':a})
df = pharma_data['Treated_with_drugs'].str.strip().str.get_dummies(' ')
print (df)
DX1 DX2 DX3 DX4 DX5 DX6
0 1 0 0 0 0 0
1 1 1 0 0 0 0
2 1 1 1 0 0 0
3 1 1 1 1 0 0
4 1 1 1 1 1 0
5 1 1 1 0 1 0
6 1 1 0 1 0 0
7 1 1 0 1 1 0
8 1 1 0 0 1 0
9 1 0 1 0 0 0
10 1 0 1 1 0 0
11 1 0 1 1 1 0
12 1 0 1 0 1 0
13 1 0 0 1 0 0
14 1 0 0 1 1 0
15 1 0 0 0 1 0
16 0 1 0 0 0 0
17 0 1 1 0 0 0
18 0 1 1 1 0 0
19 0 1 1 1 1 0
20 0 1 1 0 1 0
21 0 1 0 1 0 0
22 0 1 0 1 1 0
23 0 1 0 0 1 0
24 0 0 1 0 0 0
25 0 0 1 1 0 0
26 0 0 1 1 1 0
27 0 0 1 0 1 0
28 0 0 0 1 0 0
29 0 0 0 1 1 0
30 0 0 0 0 1 0
31 0 0 0 0 0 1