Search code examples
pythonpandasone-hot-encoding

get_dummies split character


I have data labelled which I need to apply one-hot-encoding: '786.2', 'ICD-9-CM|786.2', 'ICD-9-CM', '786.2b|V13.02', 'V13.02', '279.12', 'ICD-9-CM|V42.81' is labels. The | mean that the document have 2 labels at the same time. So I wrote the code like this:

labels = np.asarray(label_docs)

labels = np.array([u'786.2', u'ICD-9-CM|786.2', u'|ICD-9-CM', u'786.2b|V13.02', u'V13.02', u'279.12', u'ICD-9-CM|V42.81|'])

df = pd.DataFrame(labels, columns=['label'])
labels = df['label'].str.get_dummies(sep='|')

and the result:

279.12  786.2  786.2b  ICD-9-CM  V13.02  V42.81
0       0      1       0         0       0       0
1       0      1       0         1       0       0
2       0      0       0         1       0       0
3       0      0       1         0       1       0
4       0      0       0         0       1       0
5       1      0       0         0       0       0
6       0      0       0         1       0       1

However, now I only want 1 label for each document:

'ICD-9-CM|786.2' is 'ICD-9-CM',

'ICD-9-CM|V42.81|' is 'ICD-9-CM'.

How could I do seperate by get_dummies like that?


Solution

  • I think you need str.strip and str.split and then select first item of list by str[0]:

    print (df.label.str.strip('|').str.split('|').str[0])
    0       786.2
    1    ICD-9-CM
    2    ICD-9-CM
    3      786.2b
    4      V13.02
    5      279.12
    6    ICD-9-CM
    Name: label, dtype: object
    
    labels = df.label.str.strip('|').str.split('|').str[0].str.get_dummies()
    print (labels)
       279.12  786.2  786.2b  ICD-9-CM  V13.02
    0       0      1       0         0       0
    1       0      0       0         1       0
    2       0      0       0         1       0
    3       0      0       1         0       0
    4       0      0       0         0       1
    5       1      0       0         0       0
    6       0      0       0         1       0
    

    If in row with index 2 need no value, remove str.strip:

    print (df.label.str.split('|').str[0])
    0       786.2
    1    ICD-9-CM
    2            
    3      786.2b
    4      V13.02
    5      279.12
    6    ICD-9-CM
    Name: label, dtype: object
    
    labels = df.label.str.split('|').str[0].str.get_dummies(sep='|')
    print (labels)
    
       279.12  786.2  786.2b  ICD-9-CM  V13.02
    0       0      1       0         0       0
    1       0      0       0         1       0
    2       0      0       0         0       0
    3       0      0       1         0       0
    4       0      0       0         0       1
    5       1      0       0         0       0
    6       0      0       0         1       0