I have a dataframe with event data. I have two columns: Primary and Secondary. The Primary and Secondary columns both contain lists of tags (e.g., ['Fun event', 'Dance party']).
primary secondary combined
['booze', 'party'] ['singing', 'dance'] ['booze', 'party', 'singing', 'dance']
['concert'] ['booze', 'vocals'] ['concert', 'booze', 'vocals']
I want to dummy code the data so that primary columns have a 1 code, non-observed columns have a 0, and values in the secondary column have a .5 value. Like so:
combined booze party singing dance concert vocals
['booze', 'party', 'singing', 'dance'] 1 1 .5 .5 0 0
['concert', 'booze', 'vocals'] .5 0 0 0 1 .5
df1=pd.get_dummies(df.combined.apply(pd.Series).stack()).sum(level=0)
df1[df1.apply(lambda x : [x.name in y for y in df.iloc[x.index,2]])]-=0.5
df1
Out[173]:
booze concert dance party singing vocals
0 1.0 0 0.5 1 0.5 0.0
1 0.5 1 0.0 0 0.0 0.5
Datainput :
df = pd.DataFrame({'primary': [['booze', 'party'] , ['concert']],
'secondary': [['singing', 'dance'], ['booze', 'vocals']],
'combined': [['booze', 'party', 'singing', 'dance'], ['concert', 'booze', 'vocals']]})